← Back ICRA 2026

SVP: Improving Vision-Language-Action Models with Dual Stochastic Visual Prompting

Zhide Zhong, Haodong Yan, Tianran Zhang, Lujia Wang, Jin Wu, Jun Ma, Xinhu ZHENG, Haoang Li

PDF

AI summary

Key figure (auto-extracted from paper)

A training-only visual prompting technique eliminates shortcut learning and significantly boosts the robustness and success rates of Vision-Language-Action models without architectural changes.

Vision-Language-Action models shortcut learning visual prompting robotic policy learning attention stabilization data-centric training

Problem

Vision-Language-Action models often suffer from distracted attention during fine-tuning, where they latch onto spurious background correlations instead of following language instructions, hindering robust policy learning.

Approach

Dual Stochastic Visual Prompting (SVP) acts as a training-only visual scaffold by probabilistically applying and randomly intensifying spotlight prompts on target objects, forcing the model to internalize robust, instruction-grounded attention.

Key results

Boosts standard OpenVLA success rate by 3.7% on LIBERO
Delivers an 8.2% absolute gain on the challenging LIBERO-10 benchmark
Improves the optimized OpenVLA-OFT variant by 0.9% average success rate
Validates consistent real-world robotic manipulation improvements over baselines

Why it matters

Demonstrates that data-centric training interventions can unlock substantial performance gains in generalist robot policies without architectural overhead, offering a practical path for robust real-world deployment.

Abstract

Vision-Language-Action (VLA) models, such as OpenVLA, hold the promise of generalist robots, yet their performance is often impaired by distracted attention, which we identify as a manifestation of shortcut learning. We posit that the solution lies not in architectural modifications, but in a new training paradigm centered on visual prompts that provide explicit visual guidance to the model. We introduce Dual Stochastic Visual Prompting (SVP) as a concrete realization of this paradigm. SVP functions as a training-only “visual scaf- fold”, a non-invasive mechanism that requires no architectural modifications. Our work demonstrates that this data-centric training paradigm is a highly effective strategy for mitigating distracted attention, enabling the learning of more robust and capable policies without architectural overhead. SVP yields substantial gains on the challenging LIBERO benchmark and real robot experiments. It improves the absolute success rate of the standard OpenVLA by 8.2% on long-horizon tasks and enhances the performance of the highly optimized OpenVLA- This work was supported in part by the Natural Science Foundation of China under Grant 62403401, in part by the Guangdong Basic and Applied Basic Research Foundation under Grant 2026A1515012323 and Grant 2024A1515011992, in part by the Guangdong Provincial Project under Grant 2024QN11X127, and in part by the AI Research and Learning Base of Urban Culture under Grant 2023WZJD008. *Zhide Zhong and Haodong Yan contributed equally to this work. Corresponding Author: Haoang Li (haoangli@hkust-gz.edu.cn) 1The Hong Kong University of Science and Technology (Guangzhou) 2University of Science and Technology Beijing OFT. These improvements are validated on a real robot, where our model consistently outperforms baselines across a variety of manipulation tasks.

Index terms

Imitation Learning Representation Learning Machine Learning for Robot Control