← Back ICRA 2026

RoboEval: Where Robotic Manipulation Meets Structured and Scalable Evaluation

Yi Ru Wang, Carter Ung, Christopher Tan, Grant Tannert, Jiafei Duan, Josephine Li, Anh Le, Rishabh Oswal, Markus Grotz, Wilbert Pumacay, Yuquan Deng, Ranjay Krishna, Dieter Fox, Siddhartha Srinivasa

PDF

AI summary

Key figure (auto-extracted from paper)

Binary success rates mask critical differences in policy execution, making multi-dimensional behavioral and outcome metrics essential for accurate evaluation.

Robotic Manipulation Evaluation Benchmark Behavioral Metrics Bimanual Control Policy Diagnosis Simulation

Problem

Existing robotic manipulation benchmarks rely on binary success rates that mask execution quality, obscure failure structures, and fail to distinguish between robust and brittle policies.

Approach

The authors introduce ROBOEVAL, a modular simulation benchmark featuring eight bimanual tasks, systematic variations, and over 3,000 expert demonstrations, instrumented with standardized behavioral and outcome metrics for fine-grained policy analysis.

Key results

Benchmark of 8 bimanual tasks with 3,000+ expert demonstrations
Behavioral metrics distinguish policies with identical success rates
Outcome metrics reveal structured stage-wise failure modes
Metrics show stability across variations and correlate with success

Why it matters

Enables robotics researchers to diagnose policy limitations and drive progress beyond simplistic success-rate tracking.

Abstract

We introduce ROBOEVAL, a structured evalua- tion framework and benchmark for robotic manipulation that augments binary success with principled behavioral and out- come metrics. Existing evaluations often collapse performance into outcome counts, masking differences in execution quality and obscuring failure structure. ROBOEVAL provides eight bimanual tasks with systematically controlled variations, more than three thousand expert demonstrations, and a modular simulation platform for reproducible experimentation. All tasks are instrumented with standardized metrics that quantify efficiency, coordination, and safety/stability, as well as outcome measures that trace stagewise progress and localize failure modes. Through extensive experiments with state-of-the-art visuomotor policies, we validate these metrics by analyzing their stability under variation, discriminative power across policies with similar success rates, and correlation with task success. Project Page: https://robo-eval.github.io

Index terms

Performance Evaluation and Benchmarking Bimanual Manipulation Imitation Learning