Research Analyzer
← Back ICRA 2026

Toward Human Preference Optimization for Vision-Language-Action Models: A Pilot Study on the Limits of Imitation Learning

Tae-Won Lee, DongWook Kim

PDF

AI summary

Key figure (auto-extracted from paper)
Imitation learning fails dramatically on complex multi-step robotic tasks, but Human Preference Optimization offers a targeted post-training strategy to bridge this gap.
Vision-Language-Action Models Imitation Learning Human Preference Optimization Reinforcement Learning Robotic Manipulation

Problem

Vision-Language-Action models trained via imitation learning perform well on simple tasks but degrade significantly on complex, multi-step manipulation due to distribution shift, lack of recovery behaviors, and inability to optimize for quality beyond demonstrations.

Approach

The authors evaluate a state-of-the-art VLA model across a progressive task benchmark to quantify imitation learning's limits, then propose Human Preference Optimization as a post-training pipeline that uses human trajectory rankings to train a reward model and refine policies via reinforcement learning.

Key results

  • Quantified a success rate drop from 90% on simple grasping to 4.5% on sequential multi-step tasks
  • Identified three core failure modes of behavior cloning: no recovery, compounding distribution shift, and lack of quality optimization
  • Proposed a structured HPO pipeline leveraging human trajectory rankings and RL for iterative policy refinement
  • Demonstrated that average episode time scales drastically with task complexity, highlighting timeout failures in long-horizon manipulation

Why it matters

Provides a critical empirical baseline for VLA limitations and introduces a practical human-in-the-loop optimization framework to advance reliable robotic manipulation beyond demonstration data.

Abstract

Vision-Language-Action (VLA) models trained via imitation learning have achieved impressive results on robotic manipulation, yet their performance degrades significantly on complex, multi-step tasks. We evaluate NVIDIA GR00T N1.6, a state-of-the-art cross-embodiment VLA model, on the SimplerEnv benchmark to systematically identify where imitation learning falls short. Our results reveal a stark performance gap between simple single-step tasks (e.g., picking a can, 90.0%) and complex sequential tasks (e.g., placing an object in a closed drawer, 4.5%), suggesting that behavior cloning alone cannot capture the nuanced decision- making required for long-horizon manipulation. Based on these findings, we propose Human Preference Optimization (HPO) as a post-training strategy to bridge this gap — leveraging human trajectory rankings and reinforcement learning to refine VLA policies beyond what demonstration data alone can teach.

Index terms

Imitation Learning Reinforcement Learning Deep Learning in Grasping and Manipulation

Related papers