ACG: Action Coherence Guidance for Flow-Based Vision-Language-Action Models
Minho Park, Kinam Kim, Junha Hyung, Hyojin Jang, Hoiyeong Jin, Jooyeol Yun, Hojoon Lee, Jaegul Choo
AI summary
Problem
Flow-based Vision-Language-Action models trained via imitation learning memorize noise from human demonstrations, causing unstable actions and trajectory drift that degrade performance in fine-grained manipulation.
Approach
ACG replaces self-attention maps with identity matrices to generate an incoherent vector field, then guides sampling in the opposite direction to enforce temporal consistency without retraining.
Key results
- Consistently improves success rates across RoboCasa, DexMimicGen, and real-world SO-101 benchmarks
- Delivers substantial gains on fine manipulation tasks (+23.1% button pressing, +11.8% insertion, +28.8% real-world pick-and-place)
- Outperforms vanilla models, action smoothing, ensembling, and classifier-free guidance without additional training
- Reduces trajectory drift and action instability during critical manipulation moments
Why it matters
Enables reliable, precise robotic manipulation with existing flow-based VLA policies through a simple, plug-and-play inference-time enhancement.
Abstract
Diffusion and flow matching models have emerged as powerful robot policies, enabling Vision-Language-Action (VLA) models to generalize across diverse scenes and instruc- tions. Yet, when trained via imitation learning, their high generative capacity makes them sensitive to noise in human demonstrations: jerks, pauses, and jitter which reduce action coherence. Reduced action coherence causes instability and trajectory drift during deployment, failures that are catas- trophic in fine-grained manipulation where precision is crucial. In this paper, we present Action Coherence Guidance (ACG) for VLA models, a training-free test-time guidance algorithm that improves action coherence and thereby yields performance gains. Evaluated on RoboCasa, DexMimicGen, and real-world SO-101 tasks, ACG consistently improves action coherence and boosts success rates across diverse manipulation tasks.