Seeing Space and Motion: Enhancing Latent Actions with Geometric and Dynamic Awareness for Vision-Language-Action Models
Zhejia Cai, Yandan Yang, Xinyuan Chang, shiyi liang, Ronghan Chen, Feng Xiong, Mu Xu, ruqi huang
AI summary
Problem
Existing latent action models suffer from poor spatial understanding due to texture-biased encoders and limited temporal perception from sparse frame inputs, leading to unstable and ambiguous action representations.
Approach
The authors introduce Farsighted-LAM, which uses DINOv2 features for geometrically consistent spatial encoding and processes consecutive frames to capture dynamic motion patterns. This is integrated into SSM-VLA, an end-to-end framework that explicitly predicts future visual states via a visual chain-of-thought module before generating actions.
Key results
- Farsighted-LAM framework with geometry-aware spatial encoding and multi-scale temporal modeling
- SSM-VLA end-to-end VLA policy integrating visual chain-of-thought reasoning
- State-of-the-art performance on the CALVIN ABC-D simulation benchmark
- Successful zero-shot generalization and real-world robotic manipulation validation
Why it matters
Provides a more robust and interpretable foundation for embodied AI agents tackling complex, long-horizon manipulation tasks in both simulated and real-world environments.
Abstract
Latent Action Models (LAMs) enable Vision- Language-Action (VLA) systems to learn semantic action representations from large-scale unannotated data. Yet, we identify two bottlenecks of LAMs: 1) the commonly adopted end-to-end trained image encoder suffers from poor spatial understanding; 2) LAMs can be fragile when input frames are temporally distant, leading to limited temporal percep- tion. Such factors inevitably hinder stable and clear action modeling. To this end, we propose Farsighted-LAM, a latent action framework with geometry-aware spatial encoding and multi-scale temporal modeling, capturing structural priors and dynamic motion patterns from consecutive frames. We further propose SSM-VLA, an end-to-end VLA framework built upon Farsighted-LAM, which integrates structured perception with a visual Chain-of-Thought module to explicitly reason about environmental dynamics, enhancing decision consistency and interpretability. We validate SSM-VLA on multiple VLA tasks in both simulation and real-world settings, and achieve state-of- the-art performance. Our results demonstrate that our strategy 1AMAP, Alibaba Group. 2Tsinghua Shenzhen International Graduate School, Tsinghua University. 3School of Software Engineering, Xi’an Jiaotong University. *This work was conducted during the internship at Alibaba Group. †Corresponding author: ruqihuang@sz.tsinghua.edu.cn ‡Project leader. of combining geometry-aware modeling, temporal coherence, and explicit reasoning is effective in enhancing the robustness and generalizability of embodied intelligence.