← Back ICRA 2026

One-Policy-Fits-All: Geometry-Aware Action Latents for Cross-Embodiment Manipulation

Juncheng Mu, Sizhe Yang, Hojin Bae, Feiyu Jia, Qingwei Ben, Boyi Li, Huazhe Xu, Jiangmiao Pang

PDF

AI summary

Key figure (auto-extracted from paper)

OPFA enables a single policy to co-train across diverse grippers and dexterous hands using a unified geometry-aware latent space, drastically improving data efficiency and cross-embodiment generalization.

cross-embodiment manipulation latent action representation dexterous hands geometry-aware encoding unified decoder few-shot learning

Problem

Cross-embodiment manipulation is hindered by drastic differences in action spaces and structural disparities between end-effectors, making joint training difficult and requiring costly, embodiment-specific data collection for new robots.

Approach

OPFA learns a Geometry-Aware Latent Representation (GaLR) from 3D point clouds of end-effector states using 3D convolutions and transformers, then uses a unified latent retargeting decoder to recover embodiment-specific actions without per-embodiment tuning.

Key results

Constructs a Geometry-Aware Latent Representation (GaLR) to unify action dimensions across diverse end-effectors without manual annotation
Enables end-to-end cross-embodiment co-training with a unified decoder that requires no embodiment-specific tuning
Improves cross-embodiment success rates by over 50% compared to single-source training
Achieves performance comparable to a model trained on 72 demonstrations using only 8 new demonstrations

Why it matters

It provides a scalable, data-efficient framework for training universal manipulation policies across diverse robot hardware, accelerating real-world deployment and reducing data collection costs.

Abstract

Cross-embodiment manipulation is crucial for en- hancing the scalability of robot manipulation and reducing the high cost of data collection. However, the significant differences between embodiments, such as variations in action spaces and structural disparities, pose challenges for joint training across multiple sources of data. To address this, we propose One- Policy-Fits-All (OPFA), a framework that enables learning a single, versatile policy across multiple embodiments. We first learn a Geometry-Aware Latent Representation (GaLR), which leverages 3D convolution networks and transformers to build a shared latent action space across different embodiments. Then we design a unified latent retargeting decoder that extracts embodiment-specific actions from the latent representations, without any embodiment-specific decoder tuning. OPFA enables end-to-end co-training of data from diverse embodiments, including various grippers and dexterous hands with arbitrary degrees of freedom, significantly improving data efficiency and reducing the cost of skill transfer. We conduct extensive experi- ments across 11 different end-effectors. The results demonstrate that OPFA significantly improves policy performance in diverse settings by leveraging heterogeneous embodiment data. For instance, cross-embodiment co-training can improve success rates by more than 50% compared to single-source training. Moreover, by adding only a few demonstrations from a new embodiment (e.g., eight), OPFA can achieve performance com- parable to that of a well-trained model with 72 demonstrations.

Index terms

Grippers and Other End-Effectors Dexterous Manipulation Multifingered Hands