Learning End-To-End Dexterous Arm-Hand VLA Policies with Shared Autonomy: DexGrasp AI Copilot for Efficient Teleoperation
Yu Cui, Yujian Zhang, Lina Tao, Yang Li, Xinyu Yi, Zhibin (Alex) Li
AI summary
Problem
Training effective Vision-Language-Action (VLA) models for dexterous manipulation requires large-scale, high-quality demonstration data, but fully manual teleoperation overloads human operators while automated planning lacks naturalness and diversity.
Approach
The authors propose a shared autonomy system where a human operator teleoperates the robotic arm via VR while an autonomous DexGrasp-VLA policy acts as an AI copilot to generate force-adaptive grasping actions for a five-finger hand, drastically reducing cognitive load and enabling efficient collection of high-quality arm-hand coordination data.
Key results
- Shared autonomy framework reduces cognitive load and enables efficient high-quality data collection
- Arm-Hand Feature Enhancement module captures distinct macro- and micro-movement dynamics
- Corrective human-in-the-loop teleoperation enables continuous policy refinement via failure recovery
- End-to-end VLA policy achieves ~90% success rate on over 50 diverse objects
Why it matters
Enables scalable, high-fidelity data collection and training for complex dexterous manipulation, advancing general-purpose humanoid robots and AI-driven teleoperation systems.
Abstract
Achieving human-like dexterous manipulation is essential for general-purpose robots but remains a challenge. Recent advances in Vision-Language-Action (VLA) models offer the potential to learn flexible skills from demonstration data. However, training effective VLAs requires a large amount of high-quality data, which is difficult to obtain: fully manual teleoperation cognitively overloads human operators, while automated planning produces unnatural motions and lacks data diversity. We present a Shared Autonomy framework: a human operator teleoperates the arm for global motion, while an autonomous DexGrasp-VLA policy, as an AI Copilot, generates force-adaptive actions for a five-finger hand with tactile feed- back – drastically reducing human effort and enabling efficient collection of high-quality demonstrations. Using these data, we train an end-to-end VLA policy with a novel Arm-Hand Feature Enhancement module – shared representations are conjunct with separate arm and hand latent features, representing the distinct dynamics of macro and micro movements, leading to more robust and natural coordination of arm-hand motions. Our Corrective Teleoperation can further refine the policy with failure-recovery demonstrations via human intervention. Experiments show our approach efficiently generates high- quality data and learns policies with a high success rate and natural behaviors. The trained arm-hand VLA policy is effectively generalized to both seen and unseen objects, with a success rate of around 90% in more than 50 diverse objects.