← Back ICRA 2026

Toward Embodiment Equivariant Vision-Language-Action Policy

Anzhe Chen, Yifei Yang, Zhenjie Zhu, Kechun Xu, Zhongxiang Zhou, Rong Xiong, Yue Wang

PDF

AI summary

Key figure (auto-extracted from paper)

Enforcing configuration equivariance in action space design enables vision-language-action policies to generalize to novel robot embodiments with minimal fine-tuning.

Vision-language-action embodiment equivariance cross-embodiment generalization action space design robot policy learning geometry-aware networks

Problem

Existing vision-language-action policies fail to generalize to novel robot configurations after pre-training, requiring costly adaptation due to action spaces that overfit to specific embodiments.

Approach

The authors formulate cross-embodiment learning as a configuration-equivariance problem, introducing an equivariant action decoder and a geometry-aware network that preserves spatial cues without breaking invariance.

Key results

Theoretical framework unifying action space design across embodiments via equivariance
Equivariant action decoder robust to camera calibration errors
Geometry-aware architecture enhancing embodiment-agnostic spatial reasoning
96.4% average success on LIBERO with only 5 GPU hours of fine-tuning and 94% zero-shot success on a novel Fanuc robot

Why it matters

Reduces the computational and data burden of adapting large-scale robot policies to new hardware, accelerating deployment across diverse robotic platforms.

Abstract

Vision-language-action policies learn manipula- tion skills across tasks, environments and embodiments through large-scale pre-training. However, their ability to generalize to novel robot configurations remains limited. Most approaches emphasize model size, dataset scale and diversity while paying less attention to the design of action spaces. This leads to the configuration generalization problem, which requires costly adaptation. We address this challenge by formulating cross- embodiment pre-training as designing policies equivariant to embodiment configuration transformations. Building on this principle, we propose a framework that (i) establishes a embodiment equivariance theory for action space and pol- icy design, (ii) introduces an action decoder that enforces configuration equivariance, and (iii) incorporates a geometry- aware network architecture to enhance embodiment-agnostic spatial reasoning. Extensive experiments in both simulation and real-world settings demonstrate that our approach improves pre-training effectiveness and enables efficient fine-tuning on novel robot embodiments. Our code is available at https: //github.com/hhcaz/e2vla

Index terms

Service Robotics Domestic Robotics Imitation Learning