← Back ICRA 2026

A Taxonomy for Evaluating Generalist Robot Manipulation Policies

Jensen Gao, Suneel Belkhale, Sudeep Dasari, Ashwin Balakrishna, Dhruv Shah, Dorsa Sadigh

PDF

AI summary

Key figure (auto-extracted from paper)

Open-source vision-language-action robot policies struggle with semantic generalization despite large-scale pre-training, underscoring the need for structured evaluation frameworks.

Robot manipulation Generalization taxonomy Vision-language-action models Benchmark evaluation Robotic policies Semantic generalization

Problem

Evaluating generalization in robot manipulation lacks consistency, with each study proposing different, often unreproducible metrics that obscure progress toward real-world deployability.

Approach

The authors introduce ⋆-Gen, a systematic taxonomy categorizing generalization into visual, semantic, and behavioral axes based on policy perturbations, and validate it through two real-world benchmarks.

Key results

Introduction of ⋆-Gen taxonomy structuring generalization into visual, semantic, and behavioral axes
Development of BridgeV2-⋆ benchmark evaluating 13 generalization axes across state-of-the-art VLA models
Discovery that open-source vision-language-action models exhibit weak semantic generalization despite internet-scale pre-training
Validation of ⋆-Gen on dexterous, long-horizon bimanual manipulation using the ALOHA 2 platform

Why it matters

Provides a standardized, reproducible framework for benchmarking and advancing generalist robot manipulation policies, crucial for researchers and developers aiming for real-world robotic deployment.

Abstract

Machine learning for robot manipulation promises to unlock generalization to novel tasks and environments. But how should we measure the progress of these policies towards generalization? Evaluating and quantifying generalization is the Wild West of modern robotics, with each work proposing and measuring different types of generalization in their own, often difficult to reproduce settings. In this work, our goal is (1) to outline the forms of generalization we believe are important for robot manipulation in a comprehensive and fine-grained manner, and (2) to provide reproducible guidelines for measuring these notions of generalization. We first propose ⋆-Gen, a taxonomy of generalization for robot manipulation structured around visual, se- mantic, and behavioral generalization. Next, we instantiate ⋆-Gen with two case studies on real-world benchmarking: one based on open-source models and the Bridge V2 dataset, and another based on the bimanual ALOHA 2 platform that covers more dexterous and longer horizon tasks. Our case studies reveal many interesting insights:forexample,weobservethatopen-sourcevision-language- action models often struggle with semantic generalization, despite pre-training on internet-scale language datasets.

Index terms

Big Data in Robotics and Automation Deep Learning in Grasping and Manipulation Imitation Learning