Generate, Transfer, Adapt: Learning Functional Dexterous Grasping from a Single Human Demonstration
Xingyi He, Adhitya Polavaram, Yunhao Cao, Om Deshmukh, Tianrui Wang, Xiaowei Zhou, Kuan Fang
AI summary
Problem
Learning functional dexterous grasping is hindered by the scarcity of large-scale, high-quality grasp datasets and the lack of integrated semantic and geometric reasoning in existing models, making it difficult to generalize to unseen objects.
Approach
The framework uses a three-stage data engine to generate diverse training grasps from a single human video via 2D-3D correspondence transfer and physics-informed optimization, paired with a multimodal network that fuses RGB and geometric features to predict grasps for novel objects.
Key results
- Generates 11 million grasp-image pairs for 900 objects across 9 categories from a single demo
- Achieves 69% success rate on unseen real-world objects
- Outperforms state-of-the-art baselines in simulation and real-world experiments
- Enables robust category-level generalization to novel objects with large shape variations
Why it matters
It provides a scalable, low-data pathway for robots to master complex tool-use and manipulation tasks, significantly advancing practical dexterous manipulation.
Abstract
Functional grasping with dexterous robotic hands is a key capability for enabling tool use and complex manip- ulation, yet progress has been constrained by two persistent bottlenecks: the scarcity of large-scale datasets and the absence of integrated semantic and geometric reasoning in learned models. In this work, we present CorDex, a framework that robustly learns dexterous functional grasps of novel objects from synthetic data generated from just a single human demon- stration. At the core of our approach is a correspondence-based data engine that generates diverse, high-quality training data in simulation. Based on the human demonstration, our data engine generates diverse object instances of the same category, transfers the expert grasp to the generated objects through correspondence estimation, and adapts the grasp through optimization. Building on the generated data, we introduce a multimodal prediction network that integrates visual and geometric information. By devising a local–global fusion module and an importance-aware sampling mechanism, we enable robust and computationally efficient prediction of functional dexterous grasps. Through extensive experiments across various object categories, we demonstrate that CorDex generalizes well to unseen object instances and significantly outperforms state- of-the-art baselines. For additional results and videos, please visit https://cordex-manipulation.github.io.