← Back ICRA 2026

Seeing Space and Motion: Enhancing Latent Actions with Geometric and Dynamic Awareness for Vision-Language-Action Models

Zhejia Cai, Yandan Yang, Xinyuan Chang, shiyi liang, Ronghan Chen, Feng Xiong, Mu Xu, ruqi huang

PDF

AI summary

Key figure (auto-extracted from paper)

SSM-VLA enhances robot decision-making by combining geometry-aware spatial encoding, multi-scale temporal modeling, and visual chain-of-thought reasoning to predict future states before acting.

Vision-Language-Action Latent Action Models Geometric Awareness Temporal Modeling Visual Chain-of-Thought Embodied AI

Problem

Existing latent action models suffer from poor spatial understanding due to texture-biased encoders and limited temporal perception from sparse frame inputs, leading to unstable and ambiguous action representations.

Approach

The authors introduce Farsighted-LAM, which uses DINOv2 features for geometrically consistent spatial encoding and processes consecutive frames to capture dynamic motion patterns. This is integrated into SSM-VLA, an end-to-end framework that explicitly predicts future visual states via a visual chain-of-thought module before generating actions.

Key results

Farsighted-LAM framework with geometry-aware spatial encoding and multi-scale temporal modeling
SSM-VLA end-to-end VLA policy integrating visual chain-of-thought reasoning
State-of-the-art performance on the CALVIN ABC-D simulation benchmark
Successful zero-shot generalization and real-world robotic manipulation validation

Why it matters

Provides a more robust and interpretable foundation for embodied AI agents tackling complex, long-horizon manipulation tasks in both simulated and real-world environments.

Abstract

Latent Action Models (LAMs) enable Vision- Language-Action (VLA) systems to learn semantic action representations from large-scale unannotated data. Yet, we identify two bottlenecks of LAMs: 1) the commonly adopted end-to-end trained image encoder suffers from poor spatial understanding; 2) LAMs can be fragile when input frames are temporally distant, leading to limited temporal percep- tion. Such factors inevitably hinder stable and clear action modeling. To this end, we propose Farsighted-LAM, a latent action framework with geometry-aware spatial encoding and multi-scale temporal modeling, capturing structural priors and dynamic motion patterns from consecutive frames. We further propose SSM-VLA, an end-to-end VLA framework built upon Farsighted-LAM, which integrates structured perception with a visual Chain-of-Thought module to explicitly reason about environmental dynamics, enhancing decision consistency and interpretability. We validate SSM-VLA on multiple VLA tasks in both simulation and real-world settings, and achieve state-of- the-art performance. Our results demonstrate that our strategy 1AMAP, Alibaba Group. 2Tsinghua Shenzhen International Graduate School, Tsinghua University. 3School of Software Engineering, Xi’an Jiaotong University. *This work was conducted during the internship at Alibaba Group. †Corresponding author: ruqihuang@sz.tsinghua.edu.cn ‡Project leader. of combining geometry-aware modeling, temporal coherence, and explicit reasoning is effective in enhancing the robustness and generalizability of embodied intelligence.

Index terms

AI-Based Methods Deep Learning in Grasping and Manipulation Deep Learning Methods