← Back ICRA 2026

Latent Action Diffusion for Cross-Embodiment Manipulation

Erik Bauer, Elvis Nava, Robert K. Katzschmann

PDF

AI summary

Key figure (auto-extracted from paper)

Co-training a single diffusion policy in a unified latent action space significantly improves manipulation success rates across different robot end-effectors.

Cross-embodiment learning latent action space diffusion policy contrastive learning skill transfer robotic manipulation

Problem

End-to-end robotic learning is hindered by data scarcity and the 'embodiment gap,' where diverse action spaces across different robot end-effectors prevent effective cross-embodiment learning and skill transfer.

Approach

The authors learn a semantically aligned latent action space using contrastive loss on retargeted end-effector poses, then train a single embodiment-agnostic diffusion policy in this shared space alongside embodiment-specific decoders.

Key results

Unified diverse end-effector action spaces into a single semantically aligned latent space via contrastive learning
Enabled multi-robot control and skill transfer using a single diffusion policy across substantially different morphologies
Achieved up to 25.3% improvement in manipulation success rates compared to single-embodiment policies
Validated architecture through ablation studies showing temperature annealing and encoder fine-tuning are critical for alignment

Why it matters

Reduces the need for extensive per-robot data collection and accelerates scalable, efficient skill transfer across diverse robotic platforms.

Abstract

End-to-end learning is emerging as a powerful paradigm for robotic manipulation, but its effectiveness is limited by data scarcity and the heterogeneity of action spaces across robot embodiments. In particular, diverse ac- tion spaces across different end-effectors create barriers for cross-embodiment learning and skill transfer. We address this challenge through diffusion policies learned in a latent action space that unifies diverse end-effector actions. We first show that we can learn a semantically aligned latent action space for anthropomorphic robotic hands, a human hand, and a parallel jaw gripper using encoders trained with a contrastive loss. Second, we show that by using our proposed latent action space for co-training on manipulation data from different end-effectors, we can utilize a single policy for multi-robot control and obtain up to 25.3% improved manipulation success rates, indicating successful skill transfer despite a significant embodiment gap. Our approach using latent cross-embodiment policies presents a new method to unify different action spaces across embodiments, enabling efficient multi-robot control and data sharing across robot setups. This unified representation significantly reduces the need for extensive data collection for each new robot morphology, accelerates generalization across embodiments, and ultimately facilitates more scalable and efficient robotic learning.

Index terms

Deep Learning in Grasping and Manipulation Imitation Learning Representation Learning