← Back ICRA 2026

Reliable and Scalable Robot Policy Evaluation with Imperfect Simulators

Apurva Badithela, David Snyder, Lihan Zha, Joseph Mikhail, Matthew O'Kelly, Anushri Dixit, Anirudha Majumdar

PDF

AI summary

Key figure (auto-extracted from paper)

Combining a small number of real-world trials with large-scale simulation via prediction-powered inference reduces hardware evaluation costs by 20–25% while maintaining statistically valid confidence intervals for robot policy performance.

Robot policy evaluation prediction-powered inference simulation-to-real gap confidence intervals scalable evaluation robotics benchmarks

Problem

Real-world robot policy evaluation is resource-intensive and often lacks statistical rigor, while relying solely on simulation introduces bias due to the simulation-to-real gap, preventing trustworthy inferences about real-world performance.

Approach

SureSim pairs a small set of real-world evaluations with large-scale simulation trials to estimate and correct simulation bias, then applies non-asymptotic mean estimation algorithms to generate finite-sample valid confidence intervals for real-world policy performance.

Key results

Introduces SureSim, a framework for finite-sample valid confidence intervals on real-world policy performance using paired real and simulation data.
Demonstrates a 20–25% reduction in required real-world hardware trials while achieving comparable statistical bounds.
Validates the approach on both a diffusion policy and a multi-task fine-tuned π0 foundation model across diverse object and environment distributions.
Analyzes method sensitivity to varying real-simulation correlation, identifying conditions where simulation augmentation provides optimal benefits.

Why it matters

Provides roboticists and researchers with a statistically rigorous, cost-effective evaluation protocol that bridges the simulation-to-real gap without sacrificing reliability.

Abstract

Rapid progress in imitation learning, foundation models, and large-scale datasets has led to robot manipulation policies that generalize to a wide-range of tasks and environ- ments. However, rigorous evaluation of these policies remains a challenge. Typically in practice, robot policies are often evaluated on a small number of hardware trials without any statistical assurances. We present SureSim, a framework to augment large- scale simulation with relatively small-scale real-world testing to provide reliable inferences on the real-world performance of a policy. Our key idea is to formalize the problem of combining real and simulation evaluations as a prediction-powered inference problem, in which a small number of paired real and simulation evaluations are used to rectify bias in large-scale simulation. We then leverage non-asymptotic mean estimation algorithms to provide confidence intervals on mean policy performance. Using physics-based simulation, we evaluate both diffusion policy and multi-task fine-tuned π0 on a joint distribution of objects and initial conditions, and find that our approach saves over 20−25% of hardware evaluation effort to achieve similar bounds on policy performance. Supplementary notes and videos can be found at https://suresim-robot-eval.github.io.

Index terms

Probability and Statistical Methods Performance Evaluation and Benchmarking