SVP: Improving Vision-Language-Action Models with Dual Stochastic Visual Prompting
Zhide Zhong, Haodong Yan, Tianran Zhang, Lujia Wang, Jin Wu, Jun Ma, Xinhu ZHENG, Haoang Li
AI summary
Problem
Vision-Language-Action models often suffer from distracted attention during fine-tuning, where they latch onto spurious background correlations instead of following language instructions, hindering robust policy learning.
Approach
Dual Stochastic Visual Prompting (SVP) acts as a training-only visual scaffold by probabilistically applying and randomly intensifying spotlight prompts on target objects, forcing the model to internalize robust, instruction-grounded attention.
Key results
- Boosts standard OpenVLA success rate by 3.7% on LIBERO
- Delivers an 8.2% absolute gain on the challenging LIBERO-10 benchmark
- Improves the optimized OpenVLA-OFT variant by 0.9% average success rate
- Validates consistent real-world robotic manipulation improvements over baselines
Why it matters
Demonstrates that data-centric training interventions can unlock substantial performance gains in generalist robot policies without architectural overhead, offering a practical path for robust real-world deployment.
Abstract
Vision-Language-Action (VLA) models, such as OpenVLA, hold the promise of generalist robots, yet their performance is often impaired by distracted attention, which we identify as a manifestation of shortcut learning. We posit that the solution lies not in architectural modifications, but in a new training paradigm centered on visual prompts that provide explicit visual guidance to the model. We introduce Dual Stochastic Visual Prompting (SVP) as a concrete realization of this paradigm. SVP functions as a training-only “visual scaf- fold”, a non-invasive mechanism that requires no architectural modifications. Our work demonstrates that this data-centric training paradigm is a highly effective strategy for mitigating distracted attention, enabling the learning of more robust and capable policies without architectural overhead. SVP yields substantial gains on the challenging LIBERO benchmark and real robot experiments. It improves the absolute success rate of the standard OpenVLA by 8.2% on long-horizon tasks and enhances the performance of the highly optimized OpenVLA- This work was supported in part by the Natural Science Foundation of China under Grant 62403401, in part by the Guangdong Basic and Applied Basic Research Foundation under Grant 2026A1515012323 and Grant 2024A1515011992, in part by the Guangdong Provincial Project under Grant 2024QN11X127, and in part by the AI Research and Learning Base of Urban Culture under Grant 2023WZJD008. *Zhide Zhong and Haodong Yan contributed equally to this work. Corresponding Author: Haoang Li (haoangli@hkust-gz.edu.cn) 1The Hong Kong University of Science and Technology (Guangzhou) 2University of Science and Technology Beijing OFT. These improvements are validated on a real robot, where our model consistently outperforms baselines across a variety of manipulation tasks.