← Back ICRA 2026

GRAPE: Generalizing Robot Policy Via Preference Alignment

Zijian Zhang, Kaiyuan Zheng, Zhaorun Chen, Joel Jang, Yi Li, Siwei Han, Chaoqi Wang, Mingyu Ding, Dieter Fox, Huaxiu Yao

PDF

AI summary

Key figure (auto-extracted from paper)

GRAPE aligns vision-language-action models to diverse objectives like safety and efficiency by optimizing preferences over full trajectories, significantly boosting generalization to unseen tasks.

Vision-Language-Action Preference Alignment Robot Policy Trajectory Optimization Task Generalization Autonomous Manipulation

Problem

Current vision-language-action models rely on behavior cloning from successful expert rollouts, causing poor generalization to new tasks, distribution bias, and an inability to adapt to customized objectives like safety or efficiency.

Approach

GRAPE decomposes complex manipulation tasks into temporal stages and uses a vision-language model to propose keypoints for automatic cost function generation. It then aligns the policy via trajectory-wise preference optimization, ranking trajectories based on flexible, objective-specific spatial-temporal constraints.

Key results

Increases in-domain and unseen success rates by 51.79% and 58.20%
Reduces collisions by 37.44% and rollout steps by 11.15% for safety and efficiency
Surpasses SFT and step-wise DPO baselines in simulation and real-world tests
Achieves strong cross-domain generalization across visual, subject, and semantic shifts

Why it matters

Provides a scalable, objective-flexible framework for training robust robotic policies without expensive online reinforcement learning or manual reward design.

Abstract

Despite the recent advancements of vision- language-action (VLA) models on a variety of robotics tasks, they suffer from critical issues such as poor generalizability to unseen tasks, due to their reliance on behavior cloning exclusively from successful rollouts. Furthermore, they are typi- cally fine-tuned to replicate demonstrations collected by experts under different settings, thus introducing distribution bias and limiting their adaptability to diverse manipulation objectives, such as efficiency, safety, and task completion. To bridge this gap, we introduce GRAPE: Generalizing Robot Policy via Preference Alignment. Specifically, GRAPE aligns VLAs on a trajectory level and implicitly models reward from both successful and failure trials to boost generalizability to diverse tasks. Moreover, GRAPE breaks down complex manipulation tasks to independent stages and automatically guides preference modeling through customized spatiotemporal constraints with keypoints proposed by a large vision-language model. Notably, these constraints are flexible and can be customized to align the model with varying objectives, such as safety, efficiency, or task success. We evaluate GRAPE across a diverse array of tasks in both real-world and simulated environments. Experimental results demonstrate that GRAPE enhances the performance of state-of-the-art VLA models, increasing success rates on in- domain and unseen manipulation tasks by 51.79% and 58.20%, respectively. Additionally, GRAPE can be aligned with various objectives, such as safety and efficiency, reducing collision rates by 37.44% and rollout step-length by 11.15%, respectively.

Index terms

Machine Learning for Robot Control Reinforcement Learning Learning from Demonstration