← Back ICRA 2026

InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning

Ji Zhang, Shihan Wu, Xu Luo, Hao Wu, Junlin Xie, Lianli Gao, Heng Tao Shen, Jingkuan Song

PDF

AI summary

Key figure (auto-extracted from paper)

InSpire boosts VLA generalization by adding a lightweight spatial reasoning step that redirects attention to task-relevant objects, eliminating spurious correlations without extra data or models.

Vision-Language-Action models spatial reasoning spurious correlations robotic manipulation generalization plug-and-play

Problem

Existing Vision-Language-Action (VLA) models rely on spurious correlations between task-irrelevant visual features and actions, which severely limits their ability to generalize to novel scenarios.

Approach

InSpire inserts a simple spatial reasoning question before the instruction and jointly trains the model to predict the direction and actions, acting as a plug-and-play plugin for existing VLAs.

Key results

Up to 10% absolute success rate gain on unseen LIBERO tasks
Outperforms larger reasoning-based VLAs with a 1B parameter model
25–26% average success rate improvement on real-world seen and unseen tasks
Requires no auxiliary data or external models, functioning as a lightweight plugin

Why it matters

Provides a simple, data-efficient way to make general-purpose robotic systems more robust and reliable in complex, novel environments.

Abstract

Leveraging pretrained Vision-Language Models (VLMs) to map language instruction and visual observations to raw low-level actions, Vision-Language-Action models (VLAs) hold great promise for achieving general-purpose robotic systems. Despite their advancements, existing VLAs tend to spuriously correlate task-irrelevant visual features with actions, limiting their generalization capacity beyond the training data. To tackle this challenge, we propose Intrinsic Spatial Reasoning (InSpire), a simple yet effective approach that mitigates the adverse effects of spurious correlations by boosting the spatial reasoning ability of VLAs. Specifically, InSpire redirects the VLA’s attention to task-relevant factors by prepending the question “In which direction is the [object] relative to the robot?” to the language instruction and aligning the model’s output answer “right/left/up/down/front/back/grasped” and predicted actions with ground-truth. Notably, InSpire can be used as a plugin to enhance existing autoregressive VLAs, requiring no extra training data or interaction with other large models. Extensive experimental results in both simulation and real-world environ- ments demonstrate the effectiveness and flexibility of InSpire.

Index terms

Transfer Learning Learning from Demonstration Representation Learning