CoDex: Learning Compositional Dexterous Functional Manipulation without Demonstrations
Bowen Jiang [University of Texas at Austin], William Reger [University of Texas at Austin], Roberto MartÃn-MartÃn [University of Texas at Austin]
AI summary
Problem
Compositional Dexterous Functional Object Manipulation requires coordinating high-level semantic understanding with low-level physical dexterity, but existing methods rely on labor-intensive demonstrations or lack the geometric precision needed for complex tool actuation.
Approach
The framework uses vision-language models to extract local and global semantic constraints from task descriptions and scenes, which guide analytic constrained optimization to generate functional grasps and reinforcement learning to refine them into complete grasp-move-actuate policies.
Key results
- 73% average success rate across six unseen CD-FOM tasks in real-world trials
- VLM-guided pose search produces significantly more appropriate task poses than prior methods
- Constraint-guided RL boosts functional success by over 40% compared to analytical grasps alone
- Zero-demonstration learning of physically viable dexterous behaviors for objects with internal mechanisms
Why it matters
Enables robots to autonomously master complex tool-use and functional manipulation without human demonstrations, advancing general-purpose robotic dexterity.
Abstract
In this work, we study Compositional Dexterous Functional Object Manipulation (CD-FOM): tasks such as aiming and actuating a spray bottle on a plant or a glue gun on wood, which require both actuating an object’s internal mechanism and controlling its pose to apply the object’s function to the environment. These tasks pose significant challenges for robots due to the demanding integration of semantic understanding —of the object’s function, actuation mode, and application area— with intricate physical dexterity —to manage grasp stability, movement trajectory, and actu- ation. We introduce CoDex, a zero-demonstration framework that autonomously discovers CD-FOM manipulation strategies. CoDex uses vision–language models (VLMs) to infer semantic constraints from the task and scene. These constraints guide analytic constrained optimization to generate a short list of functional grasp candidates that can be efficiently refined with reinforcement learning to generate full grasp–move–actuate policies transferrable from simulation to the real world. We evaluate CoDex on a 7-DoF robot arm with a 16-DoF multi- fingered hand across six CD-FOM tasks involving previously unseen objects with internal mechanisms (spray bottles, hot glue guns, air dusters, flashlights, pepper grinders) and their application to unseen target objects, showcasing its ability to autonomously discover and execute complex, physically viable dexterous behaviors without human demonstrations. More information at our website.