← Back ICRA 2026

Learning End-To-End Dexterous Arm-Hand VLA Policies with Shared Autonomy: DexGrasp AI Copilot for Efficient Teleoperation

Yu Cui, Yujian Zhang, Lina Tao, Yang Li, Xinyu Yi, Zhibin (Alex) Li

PDF

AI summary

Key figure (auto-extracted from paper)

A shared autonomy framework combining human VR teleoperation for the arm with an autonomous AI copilot for the hand enables efficient data collection and trains an end-to-end VLA policy that achieves a 90% success rate across diverse objects.

Dexterous manipulation Vision-Language-Action models Shared autonomy AI copilot Tactile feedback Human-in-the-loop teleoperation

Problem

Training effective Vision-Language-Action (VLA) models for dexterous manipulation requires large-scale, high-quality demonstration data, but fully manual teleoperation overloads human operators while automated planning lacks naturalness and diversity.

Approach

The authors propose a shared autonomy system where a human operator teleoperates the robotic arm via VR while an autonomous DexGrasp-VLA policy acts as an AI copilot to generate force-adaptive grasping actions for a five-finger hand, drastically reducing cognitive load and enabling efficient collection of high-quality arm-hand coordination data.

Key results

Shared autonomy framework reduces cognitive load and enables efficient high-quality data collection
Arm-Hand Feature Enhancement module captures distinct macro- and micro-movement dynamics
Corrective human-in-the-loop teleoperation enables continuous policy refinement via failure recovery
End-to-end VLA policy achieves ~90% success rate on over 50 diverse objects

Why it matters

Enables scalable, high-fidelity data collection and training for complex dexterous manipulation, advancing general-purpose humanoid robots and AI-driven teleoperation systems.

Abstract

Achieving human-like dexterous manipulation is essential for general-purpose robots but remains a challenge. Recent advances in Vision-Language-Action (VLA) models offer the potential to learn flexible skills from demonstration data. However, training effective VLAs requires a large amount of high-quality data, which is difficult to obtain: fully manual teleoperation cognitively overloads human operators, while automated planning produces unnatural motions and lacks data diversity. We present a Shared Autonomy framework: a human operator teleoperates the arm for global motion, while an autonomous DexGrasp-VLA policy, as an AI Copilot, generates force-adaptive actions for a five-finger hand with tactile feed- back – drastically reducing human effort and enabling efficient collection of high-quality demonstrations. Using these data, we train an end-to-end VLA policy with a novel Arm-Hand Feature Enhancement module – shared representations are conjunct with separate arm and hand latent features, representing the distinct dynamics of macro and micro movements, leading to more robust and natural coordination of arm-hand motions. Our Corrective Teleoperation can further refine the policy with failure-recovery demonstrations via human intervention. Experiments show our approach efficiently generates high- quality data and learns policies with a high success rate and natural behaviors. The trained arm-hand VLA policy is effectively generalized to both seen and unseen objects, with a success rate of around 90% in more than 50 diverse objects.

Index terms

Learning from Demonstration Imitation Learning Dexterous Manipulation