Toward Human Preference Optimization for Vision-Language-Action Models: A Pilot Study on the Limits of Imitation Learning
Tae-Won Lee, DongWook Kim
AI summary
Problem
Vision-Language-Action models trained via imitation learning perform well on simple tasks but degrade significantly on complex, multi-step manipulation due to distribution shift, lack of recovery behaviors, and inability to optimize for quality beyond demonstrations.
Approach
The authors evaluate a state-of-the-art VLA model across a progressive task benchmark to quantify imitation learning's limits, then propose Human Preference Optimization as a post-training pipeline that uses human trajectory rankings to train a reward model and refine policies via reinforcement learning.
Key results
- Quantified a success rate drop from 90% on simple grasping to 4.5% on sequential multi-step tasks
- Identified three core failure modes of behavior cloning: no recovery, compounding distribution shift, and lack of quality optimization
- Proposed a structured HPO pipeline leveraging human trajectory rankings and RL for iterative policy refinement
- Demonstrated that average episode time scales drastically with task complexity, highlighting timeout failures in long-horizon manipulation
Why it matters
Provides a critical empirical baseline for VLA limitations and introduces a practical human-in-the-loop optimization framework to advance reliable robotic manipulation beyond demonstration data.
Abstract
Vision-Language-Action (VLA) models trained via imitation learning have achieved impressive results on robotic manipulation, yet their performance degrades significantly on complex, multi-step tasks. We evaluate NVIDIA GR00T N1.6, a state-of-the-art cross-embodiment VLA model, on the SimplerEnv benchmark to systematically identify where imitation learning falls short. Our results reveal a stark performance gap between simple single-step tasks (e.g., picking a can, 90.0%) and complex sequential tasks (e.g., placing an object in a closed drawer, 4.5%), suggesting that behavior cloning alone cannot capture the nuanced decision- making required for long-horizon manipulation. Based on these findings, we propose Human Preference Optimization (HPO) as a post-training strategy to bridge this gap — leveraging human trajectory rankings and reinforcement learning to refine VLA policies beyond what demonstration data alone can teach.