← Back ICRA 2026

ACG: Action Coherence Guidance for Flow-Based Vision-Language-Action Models

Minho Park, Kinam Kim, Junha Hyung, Hyojin Jang, Hoiyeong Jin, Jooyeol Yun, Hojoon Lee, Jaegul Choo

PDF

AI summary

Key figure (auto-extracted from paper)

A training-free test-time guidance method that significantly boosts success rates in robotic manipulation by steering flow-based VLA models away from temporally incoherent action distributions.

Action Coherence Flow Matching Vision-Language-Action Models Test-Time Guidance Robotic Manipulation Inference-Time Guidance

Problem

Flow-based Vision-Language-Action models trained via imitation learning memorize noise from human demonstrations, causing unstable actions and trajectory drift that degrade performance in fine-grained manipulation.

Approach

ACG replaces self-attention maps with identity matrices to generate an incoherent vector field, then guides sampling in the opposite direction to enforce temporal consistency without retraining.

Key results

Consistently improves success rates across RoboCasa, DexMimicGen, and real-world SO-101 benchmarks
Delivers substantial gains on fine manipulation tasks (+23.1% button pressing, +11.8% insertion, +28.8% real-world pick-and-place)
Outperforms vanilla models, action smoothing, ensembling, and classifier-free guidance without additional training
Reduces trajectory drift and action instability during critical manipulation moments

Why it matters

Enables reliable, precise robotic manipulation with existing flow-based VLA policies through a simple, plug-and-play inference-time enhancement.

Abstract

Diffusion and flow matching models have emerged as powerful robot policies, enabling Vision-Language-Action (VLA) models to generalize across diverse scenes and instruc- tions. Yet, when trained via imitation learning, their high generative capacity makes them sensitive to noise in human demonstrations: jerks, pauses, and jitter which reduce action coherence. Reduced action coherence causes instability and trajectory drift during deployment, failures that are catas- trophic in fine-grained manipulation where precision is crucial. In this paper, we present Action Coherence Guidance (ACG) for VLA models, a training-free test-time guidance algorithm that improves action coherence and thereby yields performance gains. Evaluated on RoboCasa, DexMimicGen, and real-world SO-101 tasks, ACG consistently improves action coherence and boosts success rates across diverse manipulation tasks.

Index terms

Imitation Learning Learning from Demonstration Machine Learning for Robot Control