Research Analyzer
← Back ICRA 2026

Beyond Domain Randomization: Safety Certificates for Reinforcement Learning

Paula Stocco, Francesco Micheli, Niklas Schmid, John Lygeros, Efe Balta

PDF

AI summary

Key figure (auto-extracted from paper)
P2L provides formal out-of-sample safety guarantees for RL policies while maintaining competitive performance, outperforming domain randomization in preventing real-world constraint violations.
Reinforcement Learning Safety Certification Sim-to-Real Transfer Pick-to-Learn Domain Randomization Robotics

Problem

RL policies trained in simulation often fail or behave unsafely when deployed on real hardware due to sim-to-real gaps, and existing methods like domain randomization lack formal safety guarantees.

Approach

The authors wrap the Pick-to-Learn (P2L) meta-algorithm around standard RL training to compress the training dataset and compute probabilistic safety bounds on constraint satisfaction before deployment.

Key results

  • P2L maintains competitive swingup reward on a real Quanser cartpole while preventing constraint violations.
  • Domain randomization policies exhibited risky behavior and violated safety constraints in hardware trials.
  • P2L achieves higher percentages of safe runs (up to 100%) on high-dimensional Unitree Go1 simulations.
  • Provides calculable out-of-sample probabilistic risk bounds based on a compressed training set size.

Why it matters

Enables safe deployment of RL controllers in robotics by offering practical, data-driven safety certification prior to hardware testing.

Abstract

With the growing acceptance of robotics in daily life there is a growing need for certifiably safe control policies. While simulation provides a safe training environment, policies often fail in sim-to-real transfer. We propose a data-driven certification framework for reinforcement learning based on Pick-to-Learn (P2L), a meta-algorithm that uses data preference ordering to compute probabilistic bounds on the satisfaction of application dependent properties of interest. Our results demonstrate that using P2L maintains high performance while distinguishing between policies that appear similar under domain randomiza- tion alone. This work offers a practical method for preparing safe reinforcement learning policies by providing formal safety guarantees prior to hardware deployment.

Index terms

Robot Safety Planning under Uncertainty Reinforcement Learning

Related papers