← Back ICRA 2026

Do What You Say: Steering Vision-Language-Action Models Via Runtime Reasoning-Action Alignment Verification

Yilin Wu, Anqi Li, Tucker Hermans, Fabio Ramos, Andrea Bajcsy, Claudia Pérez-D'Arpino

PDF

AI summary

Key figure (auto-extracted from paper)

A training-free runtime steering method aligns a robot's generated actions with its own textual plans using a VLM verifier, boosting task success and robustness without retraining.

Vision-Language-Action models runtime steering embodied reasoning out-of-distribution generalization policy verification robotic manipulation

Problem

Reasoning Vision-Language-Action models often fail to execute their own intermediate textual plans, creating a reasoning-action faithfulness gap that degrades performance on complex or novel tasks.

Approach

The framework samples multiple candidate action sequences from the model, simulates their outcomes, and uses a pre-trained Vision-Language Model to verify and select the sequence that best matches the model's own textual plan.

Key results

Up to 15% performance gain on out-of-distribution and compositional tasks
8% task success improvement on in-distribution scenarios without retraining
Preserves long-horizon semantic coherence through runtime verification
Open-sourced reasoning-annotated LIBERO-100 dataset and extended benchmark

Why it matters

It enables reliable, generalizable robotic control by bridging the gap between high-level reasoning and low-level execution, making reasoning-enabled VLA models more practical for real-world deployment.

Abstract

Reasoning Vision Language Action (VLA) models improve robotic instruction-following by generating step-by- step textual plans before low-level actions, an approach inspired by Chain-of-Thought (CoT) reasoning in language models. Yet even with a correct textual plan, the generated actions can still miss the intended outcomes in the plan, especially in out-of- distribution (OOD) scenarios. We formalize this phenomenon as a lack of embodied CoT faithfulness, and introduce a training- free, runtime policy steering method for reasoning-action align- ment. Given a reasoning VLA’s intermediate textual plan, our framework samples multiple candidate action sequences from the same model, predicts their outcomes via simulation, and uses a pre-trained Vision-Language Model (VLM) to select the sequence whose outcome best aligns with the VLA’s own textual plan. Only executing action sequences that align with the textual reasoning turns our base VLA’s natural action diversity from a source of error into a strength, boosting robustness to semantic and visual OOD perturbations and enabling novel behavior composition without costly re-training. We also contribute a reasoning-annotated extension of LIBERO-100, environment variations tailored for OOD evaluation, and demonstrate up to 15% performance gain over prior work on behavior composi- tion tasks. The overall framework scales with compute (347ms at K = 10 samples) and data diversity. Project Website at: https://yilin-wu98.github.io/steering-reasoning-vla/

Index terms

Imitation Learning Deep Learning Methods Big Data in Robotics and Automation