Reliable and Scalable Robot Policy Evaluation with Imperfect Simulators
Apurva Badithela, David Snyder, Lihan Zha, Joseph Mikhail, Matthew O'Kelly, Anushri Dixit, Anirudha Majumdar
AI summary
Problem
Real-world robot policy evaluation is resource-intensive and often lacks statistical rigor, while relying solely on simulation introduces bias due to the simulation-to-real gap, preventing trustworthy inferences about real-world performance.
Approach
SureSim pairs a small set of real-world evaluations with large-scale simulation trials to estimate and correct simulation bias, then applies non-asymptotic mean estimation algorithms to generate finite-sample valid confidence intervals for real-world policy performance.
Key results
- Introduces SureSim, a framework for finite-sample valid confidence intervals on real-world policy performance using paired real and simulation data.
- Demonstrates a 20–25% reduction in required real-world hardware trials while achieving comparable statistical bounds.
- Validates the approach on both a diffusion policy and a multi-task fine-tuned π0 foundation model across diverse object and environment distributions.
- Analyzes method sensitivity to varying real-simulation correlation, identifying conditions where simulation augmentation provides optimal benefits.
Why it matters
Provides roboticists and researchers with a statistically rigorous, cost-effective evaluation protocol that bridges the simulation-to-real gap without sacrificing reliability.
Abstract
Rapid progress in imitation learning, foundation models, and large-scale datasets has led to robot manipulation policies that generalize to a wide-range of tasks and environ- ments. However, rigorous evaluation of these policies remains a challenge. Typically in practice, robot policies are often evaluated on a small number of hardware trials without any statistical assurances. We present SureSim, a framework to augment large- scale simulation with relatively small-scale real-world testing to provide reliable inferences on the real-world performance of a policy. Our key idea is to formalize the problem of combining real and simulation evaluations as a prediction-powered inference problem, in which a small number of paired real and simulation evaluations are used to rectify bias in large-scale simulation. We then leverage non-asymptotic mean estimation algorithms to provide confidence intervals on mean policy performance. Using physics-based simulation, we evaluate both diffusion policy and multi-task fine-tuned π0 on a joint distribution of objects and initial conditions, and find that our approach saves over 20−25% of hardware evaluation effort to achieve similar bounds on policy performance. Supplementary notes and videos can be found at https://suresim-robot-eval.github.io.