InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning
Ji Zhang, Shihan Wu, Xu Luo, Hao Wu, Junlin Xie, Lianli Gao, Heng Tao Shen, Jingkuan Song
AI summary
Problem
Existing Vision-Language-Action (VLA) models rely on spurious correlations between task-irrelevant visual features and actions, which severely limits their ability to generalize to novel scenarios.
Approach
InSpire inserts a simple spatial reasoning question before the instruction and jointly trains the model to predict the direction and actions, acting as a plug-and-play plugin for existing VLAs.
Key results
- Up to 10% absolute success rate gain on unseen LIBERO tasks
- Outperforms larger reasoning-based VLAs with a 1B parameter model
- 25–26% average success rate improvement on real-world seen and unseen tasks
- Requires no auxiliary data or external models, functioning as a lightweight plugin
Why it matters
Provides a simple, data-efficient way to make general-purpose robotic systems more robust and reliable in complex, novel environments.
Abstract
Leveraging pretrained Vision-Language Models (VLMs) to map language instruction and visual observations to raw low-level actions, Vision-Language-Action models (VLAs) hold great promise for achieving general-purpose robotic systems. Despite their advancements, existing VLAs tend to spuriously correlate task-irrelevant visual features with actions, limiting their generalization capacity beyond the training data. To tackle this challenge, we propose Intrinsic Spatial Reasoning (InSpire), a simple yet effective approach that mitigates the adverse effects of spurious correlations by boosting the spatial reasoning ability of VLAs. Specifically, InSpire redirects the VLA’s attention to task-relevant factors by prepending the question “In which direction is the [object] relative to the robot?” to the language instruction and aligning the model’s output answer “right/left/up/down/front/back/grasped” and predicted actions with ground-truth. Notably, InSpire can be used as a plugin to enhance existing autoregressive VLAs, requiring no extra training data or interaction with other large models. Extensive experimental results in both simulation and real-world environ- ments demonstrate the effectiveness and flexibility of InSpire.