Improving Vision-Language-Action Model Fine-Tuning with Structured Stage and Keyframe Supervision

Vision-Language-Action (VLA) models have shown strong potential for generalizable robotic manipulation. During fine-tuning, however, action supervision applies equally across all timesteps, without structured supervision on which manipulation stage the robot is in or what the next gripper-event targ...

Read Original Article →

Source

http://arxiv.org/abs/2606.26801v1