← Back ICRA 2026

VLA-Reasoner: Empowering Vision-Language-Action Models with Reasoning Via Online Monte Carlo Tree Search

wenkai@e.ntu.edu.sg, ziwei.wang@ntu.edu.sg

PDF

AI summary

Key figure (auto-extracted from paper)

A plug-in test-time reasoning framework using Monte Carlo Tree Search and a world model significantly boosts the long-horizon success rates of off-the-shelf Vision-Language-Action models in both simulation and real-world robotics.

Vision-Language-Action models Monte Carlo Tree Search Test-time scaling Robotic manipulation World models Reasoning

Problem

Current Vision-Language-Action models make short-sighted, next-step predictions that accumulate errors over long-horizon tasks, limiting their reliability in complex robotic manipulation.

Approach

VLA-Reasoner augments any base VLA at test time by running an online Monte Carlo Tree Search that simulates future states with a learned world model, samples actions via Kernel Density Estimation, and scores intermediate states using offline value estimation to correct deviations.

Key results

Increases average success rates by 5% to 9.8% across LIBERO and SimplerEnv benchmarks
Boosts performance of multiple base VLAs to state-of-the-art levels without retraining
Achieves higher real-world success rates on challenging manipulation tasks with minimal data
Provides a plug-and-play module that corrects incremental deviations during deployment

Why it matters

Demonstrates that lightweight test-time computation and structured reasoning can reliably extend generalist robot policies to complex, long-horizon tasks without costly retraining.

Abstract

Vision-Language-Action models (VLAs) achieve strong performance in general robotic manipulation tasks by scaling imitation learning. However, existing VLAs are limited to predicting short-sighted next-action, which struggle with long-horizon trajectory tasks due to incremental deviations. To address this problem, we propose a plug-in framework named VLA-Reasoner that effectively empowers off-the-shelf VLAs with the capability of foreseeing future states via test-time scaling. Specifically, VLA-Reasoner samples and rolls out possible action trajectories where involved actions are rationales to generate future states via a world model, which enables VLA-Reasoner to foresee and reason potential outcomes and search for the optimal actions. We further leverage Monte Carlo Tree Search (MCTS) to improve search efficiency in large action spaces, where step- wise VLA predictions seed the root. Meanwhile, we introduce a confidence sampling mechanism based on Kernel Density Estimation (KDE), to enable efficient exploration in MCTS without redundant VLA queries. We evaluate intermediate states in MCTS via an offline value estimation strategy, to score predicted futures and correct deviations with long-term feedback. We conducted extensive experiments in both simulators and the real world, demonstrating that our proposed VLA-Reasoner achieves significant improvements over the state-of-the-art VLAs. Our method highlights a potential pathway toward scalable test- time computation of robotic manipulation. The project website is available at: https://vla-reasoner.github.io/.

Index terms

Imitation Learning Deep Learning Methods Manipulation Planning