Seeing Farther and Smarter: Value-Guided Multi-Path Reflection for VLM Policy Optimization
Yanting Yang, Shenyuan Gao, Qingwen Bu, Li Chen, Dimitris N. Metaxas
AI summary
Problem
Existing VLM-based robotic planners struggle with complex physical reasoning and long-horizon planning due to inefficient implicit value learning, reliance on single greedy futures, and high inference latency.
Approach
The method decouples state evaluation from action generation by explicitly quantifying action advantage as distance-to-goal reduction, then uses beam search to explore multiple future paths and aggregates them during decoding, triggered only when necessary by a confidence-based early exit.
Key results
- 24.6% success rate improvement on unseen tasks
- 56.5% inference time reduction via early exit
- Explicit distance-to-goal advantage enables direct supervision
- Multi-path beam search aggregation corrects initial proposals
Why it matters
Provides a scalable, efficient framework for deploying VLMs in complex robotic manipulation, bridging high-level reasoning with precise physical control.
Abstract
Solving complex, long-horizon robotic manipula- tion tasks requires a deep understanding of physical inter- actions, reasoning about their long-term consequences, and precise high-level planning. Vision-Language Models (VLMs) offer a general perceive-reason-act framework for this goal. However, previous approaches using reflective planning to guide VLMs in correcting actions encounter significant limitations. These methods rely on inefficient and often inaccurate implicit learning of state-values from noisy foresight predictions, eval- uate only a single greedy future, and suffer from substantial inference latency. To address these limitations, we propose a novel test-time computation framework that decouples state evaluation from action generation. This provides a more direct and fine-grained supervisory signal for robust decision-making. Our method explicitly models the advantage of an action plan, quantified by its reduction in distance to the goal, and uses a scalable critic to estimate. To address the stochastic nature of single-trajectory evaluation, we employ beam search to explore multiple future paths and aggregate them during decoding to model their expected long-term returns, leading to more robust action generation. Additionally, we introduce a lightweight, confidence-based trigger that allows for early exit when direct predictions are reliable, invoking reflection only when necessary. Extensive experiments on diverse, unseen multi-stage robotic manipulation tasks demonstrate a 24.6% improvement in success rate over state-of-the-art baselines, while significantly reducing inference time by 56.5%.