← Back ICRA 2026

CoDex: Learning Compositional Dexterous Functional Manipulation without Demonstrations

Bowen Jiang [University of Texas at Austin], William Reger [University of Texas at Austin], Roberto MartÃn-MartÃn [University of Texas at Austin]

PDF

AI summary

Key figure (auto-extracted from paper)

CoDex autonomously learns complex dexterous tool-use for unseen objects without demonstrations, achieving 73% real-world success by bridging vision-language reasoning with constrained optimization and reinforcement learning.

dexterous manipulation functional object manipulation vision-language models reinforcement learning zero-demonstration learning constrained optimization

Problem

Compositional Dexterous Functional Object Manipulation requires coordinating high-level semantic understanding with low-level physical dexterity, but existing methods rely on labor-intensive demonstrations or lack the geometric precision needed for complex tool actuation.

Approach

The framework uses vision-language models to extract local and global semantic constraints from task descriptions and scenes, which guide analytic constrained optimization to generate functional grasps and reinforcement learning to refine them into complete grasp-move-actuate policies.

Key results

73% average success rate across six unseen CD-FOM tasks in real-world trials
VLM-guided pose search produces significantly more appropriate task poses than prior methods
Constraint-guided RL boosts functional success by over 40% compared to analytical grasps alone
Zero-demonstration learning of physically viable dexterous behaviors for objects with internal mechanisms

Why it matters

Enables robots to autonomously master complex tool-use and functional manipulation without human demonstrations, advancing general-purpose robotic dexterity.

Abstract

In this work, we study Compositional Dexterous Functional Object Manipulation (CD-FOM): tasks such as aiming and actuating a spray bottle on a plant or a glue gun on wood, which require both actuating an object’s internal mechanism and controlling its pose to apply the object’s function to the environment. These tasks pose significant challenges for robots due to the demanding integration of semantic understanding —of the object’s function, actuation mode, and application area— with intricate physical dexterity —to manage grasp stability, movement trajectory, and actu- ation. We introduce CoDex, a zero-demonstration framework that autonomously discovers CD-FOM manipulation strategies. CoDex uses vision–language models (VLMs) to infer semantic constraints from the task and scene. These constraints guide analytic constrained optimization to generate a short list of functional grasp candidates that can be efficiently refined with reinforcement learning to generate full grasp–move–actuate policies transferrable from simulation to the real world. We evaluate CoDex on a 7-DoF robot arm with a 16-DoF multi- fingered hand across six CD-FOM tasks involving previously unseen objects with internal mechanisms (spray bottles, hot glue guns, air dusters, flashlights, pepper grinders) and their application to unseen target objects, showcasing its ability to autonomously discover and execute complex, physically viable dexterous behaviors without human demonstrations. More information at our website.

Index terms

Dexterous Manipulation AI-Enabled Robotics Reinforcement Learning