Toward Embodiment Equivariant Vision-Language-Action Policy
Anzhe Chen, Yifei Yang, Zhenjie Zhu, Kechun Xu, Zhongxiang Zhou, Rong Xiong, Yue Wang
AI summary
Problem
Existing vision-language-action policies fail to generalize to novel robot configurations after pre-training, requiring costly adaptation due to action spaces that overfit to specific embodiments.
Approach
The authors formulate cross-embodiment learning as a configuration-equivariance problem, introducing an equivariant action decoder and a geometry-aware network that preserves spatial cues without breaking invariance.
Key results
- Theoretical framework unifying action space design across embodiments via equivariance
- Equivariant action decoder robust to camera calibration errors
- Geometry-aware architecture enhancing embodiment-agnostic spatial reasoning
- 96.4% average success on LIBERO with only 5 GPU hours of fine-tuning and 94% zero-shot success on a novel Fanuc robot
Why it matters
Reduces the computational and data burden of adapting large-scale robot policies to new hardware, accelerating deployment across diverse robotic platforms.
Abstract
Vision-language-action policies learn manipula- tion skills across tasks, environments and embodiments through large-scale pre-training. However, their ability to generalize to novel robot configurations remains limited. Most approaches emphasize model size, dataset scale and diversity while paying less attention to the design of action spaces. This leads to the configuration generalization problem, which requires costly adaptation. We address this challenge by formulating cross- embodiment pre-training as designing policies equivariant to embodiment configuration transformations. Building on this principle, we propose a framework that (i) establishes a embodiment equivariance theory for action space and pol- icy design, (ii) introduces an action decoder that enforces configuration equivariance, and (iii) incorporates a geometry- aware network architecture to enhance embodiment-agnostic spatial reasoning. Extensive experiments in both simulation and real-world settings demonstrate that our approach improves pre-training effectiveness and enables efficient fine-tuning on novel robot embodiments. Our code is available at https: //github.com/hhcaz/e2vla