← Back ICRA 2026

Cross-Embodiment Transfer Via Behavior-Aligned Representations

Ajay Sridhar, Jensen Gao, Jonathan Yang, Jean Mercat, Suneel Belkhale, Dorsa Sadigh

PDF

AI summary

Key figure (auto-extracted from paper)

End-effector traces and behavior-aligned representations significantly boost cross-embodiment transfer in vision-language-action models.

Cross-embodiment transfer Vision-language-action models Behavior-aligned representations Robot imitation learning Sim-to-real transfer RoboCasa-X

Problem

Cross-embodiment transfer remains difficult due to mismatched observations and action spaces across robot platforms, hindering the use of large-scale heterogeneous datasets.

Approach

The method trains vision-language-action models to implicitly align diverse robot data by jointly predicting behavior-aligned representations alongside actions, evaluated on a new simulation benchmark and real robots.

Key results

End-effector traces yield the strongest transfer gains
Representation benefits scale with larger prior datasets
Inference-time representation prediction is unnecessary
Sim-to-real transfer improves task progress by 28%

Why it matters

Provides a scalable pathway for generalist robot policies to leverage heterogeneous data across evolving hardware platforms.

Abstract

Recent progress in large-scale imitation learning for robot manipulation has been driven by leveraging datasets across a wide range of robot embodiments. However, achieving significant cross-embodiment transfer is often still challenging. In this work, we study the role of using behavior-aligned representations (e.g., object bounding boxes, language motions, end-effector traces of robot motion) in vision-language-action (VLA) models to promote cross-embodiment transfer. We hy- pothesize that by possessing invariances across embodiments while being predictive of robot actions, these representations can help unify large-scale cross-embodiment data to enhance transfer. To assess our hypothesis, we develop a simulation- based benchmark designed to assess transfer with diverse cross- embodiment data to new embodiments. Using this benchmark, we compare different representations and ways of incorporating them. We identify that end-effector traces can be particularly beneficial for transfer, representations are generally more useful with larger prior datasets, and can be used to benefit from action-free data. We also demonstrate that they can enhance sim-to-real cross-embodiment transfer, improving task comple- tion progress of real robot policies pre-trained on simulation data by 28%. We provide videos of our evaluations at our website https://ajaysridhar.com/barx/.

Index terms

Transfer Learning Big Data in Robotics and Automation Imitation Learning