← Back ICRA 2026

Seeing Farther and Smarter: Value-Guided Multi-Path Reflection for VLM Policy Optimization

Yanting Yang, Shenyuan Gao, Qingwen Bu, Li Chen, Dimitris N. Metaxas

PDF

AI summary

Key figure (auto-extracted from paper)

Explicitly evaluating action plans via distance-to-goal advantage and aggregating multiple future trajectories boosts VLM robotic policy success by 24.6% while cutting inference latency by over half.

Vision-Language Models Robotic Manipulation Test-Time Scaling Reflective Planning Value-Guided Decoding Multi-Path Search

Problem

Existing VLM-based robotic planners struggle with complex physical reasoning and long-horizon planning due to inefficient implicit value learning, reliance on single greedy futures, and high inference latency.

Approach

The method decouples state evaluation from action generation by explicitly quantifying action advantage as distance-to-goal reduction, then uses beam search to explore multiple future paths and aggregates them during decoding, triggered only when necessary by a confidence-based early exit.

Key results

24.6% success rate improvement on unseen tasks
56.5% inference time reduction via early exit
Explicit distance-to-goal advantage enables direct supervision
Multi-path beam search aggregation corrects initial proposals

Why it matters

Provides a scalable, efficient framework for deploying VLMs in complex robotic manipulation, bridging high-level reasoning with precise physical control.

Abstract

Solving complex, long-horizon robotic manipula- tion tasks requires a deep understanding of physical inter- actions, reasoning about their long-term consequences, and precise high-level planning. Vision-Language Models (VLMs) offer a general perceive-reason-act framework for this goal. However, previous approaches using reflective planning to guide VLMs in correcting actions encounter significant limitations. These methods rely on inefficient and often inaccurate implicit learning of state-values from noisy foresight predictions, eval- uate only a single greedy future, and suffer from substantial inference latency. To address these limitations, we propose a novel test-time computation framework that decouples state evaluation from action generation. This provides a more direct and fine-grained supervisory signal for robust decision-making. Our method explicitly models the advantage of an action plan, quantified by its reduction in distance to the goal, and uses a scalable critic to estimate. To address the stochastic nature of single-trajectory evaluation, we employ beam search to explore multiple future paths and aggregate them during decoding to model their expected long-term returns, leading to more robust action generation. Additionally, we introduce a lightweight, confidence-based trigger that allows for early exit when direct predictions are reliable, invoking reflection only when necessary. Extensive experiments on diverse, unseen multi-stage robotic manipulation tasks demonstrate a 24.6% improvement in success rate over state-of-the-art baselines, while significantly reducing inference time by 56.5%.

Index terms

Manipulation Planning Deep Learning in Grasping and Manipulation Autonomous Agents