GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robot manipulation by unifying perception and action. However, existing VLA systems primarily rely on textual instructions and struggle to resolve spatial ambiguity in complex scenes with multiple similar objects. To...

Read Original Article →

Source

http://arxiv.org/abs/2605.22812v1