GIVE: Grounding Human Gestures in Vision-Language-Action Models

Human communication is inherently multimodal, where language is often accompanied by non-verbal cues such as gestures to convey intentions. However, current Vision-Language-Action (VLA) models treat robotic manipulation as a pure text-driven task, overlooking the important role of gestures in Human-...

Read Original Article →

Source

http://arxiv.org/abs/2606.13435v1