A Taxonomy for Evaluating Generalist Robot Manipulation Policies
Jensen Gao, Suneel Belkhale, Sudeep Dasari, Ashwin Balakrishna, Dhruv Shah, Dorsa Sadigh
AI summary
Problem
Evaluating generalization in robot manipulation lacks consistency, with each study proposing different, often unreproducible metrics that obscure progress toward real-world deployability.
Approach
The authors introduce ⋆-Gen, a systematic taxonomy categorizing generalization into visual, semantic, and behavioral axes based on policy perturbations, and validate it through two real-world benchmarks.
Key results
- Introduction of ⋆-Gen taxonomy structuring generalization into visual, semantic, and behavioral axes
- Development of BridgeV2-⋆ benchmark evaluating 13 generalization axes across state-of-the-art VLA models
- Discovery that open-source vision-language-action models exhibit weak semantic generalization despite internet-scale pre-training
- Validation of ⋆-Gen on dexterous, long-horizon bimanual manipulation using the ALOHA 2 platform
Why it matters
Provides a standardized, reproducible framework for benchmarking and advancing generalist robot manipulation policies, crucial for researchers and developers aiming for real-world robotic deployment.
Abstract
Machine learning for robot manipulation promises to unlock generalization to novel tasks and environments. But how should we measure the progress of these policies towards generalization? Evaluating and quantifying generalization is the Wild West of modern robotics, with each work proposing and measuring different types of generalization in their own, often difficult to reproduce settings. In this work, our goal is (1) to outline the forms of generalization we believe are important for robot manipulation in a comprehensive and fine-grained manner, and (2) to provide reproducible guidelines for measuring these notions of generalization. We first propose ⋆-Gen, a taxonomy of generalization for robot manipulation structured around visual, se- mantic, and behavioral generalization. Next, we instantiate ⋆-Gen with two case studies on real-world benchmarking: one based on open-source models and the Bridge V2 dataset, and another based on the bimanual ALOHA 2 platform that covers more dexterous and longer horizon tasks. Our case studies reveal many interesting insights:forexample,weobservethatopen-sourcevision-language- action models often struggle with semantic generalization, despite pre-training on internet-scale language datasets.