RoboEval: Where Robotic Manipulation Meets Structured and Scalable Evaluation
Yi Ru Wang, Carter Ung, Christopher Tan, Grant Tannert, Jiafei Duan, Josephine Li, Anh Le, Rishabh Oswal, Markus Grotz, Wilbert Pumacay, Yuquan Deng, Ranjay Krishna, Dieter Fox, Siddhartha Srinivasa
AI summary
Problem
Existing robotic manipulation benchmarks rely on binary success rates that mask execution quality, obscure failure structures, and fail to distinguish between robust and brittle policies.
Approach
The authors introduce ROBOEVAL, a modular simulation benchmark featuring eight bimanual tasks, systematic variations, and over 3,000 expert demonstrations, instrumented with standardized behavioral and outcome metrics for fine-grained policy analysis.
Key results
- Benchmark of 8 bimanual tasks with 3,000+ expert demonstrations
- Behavioral metrics distinguish policies with identical success rates
- Outcome metrics reveal structured stage-wise failure modes
- Metrics show stability across variations and correlate with success
Why it matters
Enables robotics researchers to diagnose policy limitations and drive progress beyond simplistic success-rate tracking.
Abstract
We introduce ROBOEVAL, a structured evalua- tion framework and benchmark for robotic manipulation that augments binary success with principled behavioral and out- come metrics. Existing evaluations often collapse performance into outcome counts, masking differences in execution quality and obscuring failure structure. ROBOEVAL provides eight bimanual tasks with systematically controlled variations, more than three thousand expert demonstrations, and a modular simulation platform for reproducible experimentation. All tasks are instrumented with standardized metrics that quantify efficiency, coordination, and safety/stability, as well as outcome measures that trace stagewise progress and localize failure modes. Through extensive experiments with state-of-the-art visuomotor policies, we validate these metrics by analyzing their stability under variation, discriminative power across policies with similar success rates, and correlation with task success. Project Page: https://robo-eval.github.io