← Back ICRA 2026

Toward Human Preference Optimization for Vision-Language-Action Models: A Pilot Study on the Limits of Imitation Learning

Tae-Won Lee, DongWook Kim

PDF

AI summary

Key figure (auto-extracted from paper)

Imitation learning fails dramatically on complex multi-step robotic tasks, but Human Preference Optimization offers a targeted post-training strategy to bridge this gap.

Vision-Language-Action Models Imitation Learning Human Preference Optimization Reinforcement Learning Robotic Manipulation

Problem

Vision-Language-Action models trained via imitation learning perform well on simple tasks but degrade significantly on complex, multi-step manipulation due to distribution shift, lack of recovery behaviors, and inability to optimize for quality beyond demonstrations.

Approach

The authors evaluate a state-of-the-art VLA model across a progressive task benchmark to quantify imitation learning's limits, then propose Human Preference Optimization as a post-training pipeline that uses human trajectory rankings to train a reward model and refine policies via reinforcement learning.

Key results

Quantified a success rate drop from 90% on simple grasping to 4.5% on sequential multi-step tasks
Identified three core failure modes of behavior cloning: no recovery, compounding distribution shift, and lack of quality optimization
Proposed a structured HPO pipeline leveraging human trajectory rankings and RL for iterative policy refinement
Demonstrated that average episode time scales drastically with task complexity, highlighting timeout failures in long-horizon manipulation

Why it matters

Provides a critical empirical baseline for VLA limitations and introduces a practical human-in-the-loop optimization framework to advance reliable robotic manipulation beyond demonstration data.

Abstract

Vision-Language-Action (VLA) models trained via imitation learning have achieved impressive results on robotic manipulation, yet their performance degrades significantly on complex, multi-step tasks. We evaluate NVIDIA GR00T N1.6, a state-of-the-art cross-embodiment VLA model, on the SimplerEnv benchmark to systematically identify where imitation learning falls short. Our results reveal a stark performance gap between simple single-step tasks (e.g., picking a can, 90.0%) and complex sequential tasks (e.g., placing an object in a closed drawer, 4.5%), suggesting that behavior cloning alone cannot capture the nuanced decision- making required for long-horizon manipulation. Based on these findings, we propose Human Preference Optimization (HPO) as a post-training strategy to bridge this gap — leveraging human trajectory rankings and reinforcement learning to refine VLA policies beyond what demonstration data alone can teach.

Index terms

Imitation Learning Reinforcement Learning Deep Learning in Grasping and Manipulation