← Back ICRA 2026

Cross-Embodiment Imitation: Learning a Unified Latent Space for Multi-Robot Control

Yashuai Yan, Dongheui Lee

PDF

AI summary

Key figure (auto-extracted from paper)

A unified latent space enables goal-conditioned control policies trained solely on human data to be deployed directly across diverse humanoid robots without retraining.

latent-space control cross-embodiment imitation multi-robot systems contrastive learning goal-conditioned policies humanoid robots

Problem

Controlling diverse humanoid platforms through a unified interface is hindered by vast morphological differences and the need for platform-specific fine-tuning or extensive paired demonstration data.

Approach

The method learns a decoupled, segment-aware latent space via contrastive learning to align human and robot motions, then trains a goal-conditioned policy in this space using only human demonstrations for direct multi-robot deployment.

Key results

Decoupled latent space architecture enabling fine-grained motion alignment across diverse morphologies
Modular contrastive learning with tailored similarity metrics for cross-embodiment alignment
Goal-conditioned latent policy enabling accurate multi-robot control without fine-tuning
Scalable new-robot integration via lightweight, robot-specific embedding layer training

Why it matters

Provides a scalable, embodiment-agnostic control framework that allows developers to deploy unified policies across diverse humanoid platforms without retraining.

Abstract

We present a scalable framework for cross- embodiment humanoid robot control by learning a shared latent representation that unifies motion across humans and diverse humanoid platforms, including single-arm, dual-arm, and legged humanoid robots. Our method proceeds in two stages: first, we construct a decoupled latent space that captures localized motion patterns across different body parts using contrastive learning, enabling accurate and flexible motion retargeting even across robots with diverse morphologies. To enhance alignment between embodiments, we introduce tailored similarity metrics that combine joint rotation and end-effector positioning for critical segments, such as arms. Then, we train a goal-conditioned control policy directly within this latent space using only human data. Leveraging a conditional variational autoencoder, our policy learns to predict latent space displacements guided by intended goal directions. We show that the trained policy can be directly deployed on multiple robots without any adaptation. Further- more, our method supports the efficient addition of new robots to the latent space by learning only a lightweight, robot-specific embedding layer. The learned latent policies can also be directly applied to the new robots. Experimental results demonstrate that our approach enables robust, scalable, and embodiment-agnostic robot control across a wide range of humanoid platforms.

Index terms

Multi-Robot Systems Imitation Learning Machine Learning for Robot Control