Beyond Domain Randomization: Safety Certificates for Reinforcement Learning
Paula Stocco, Francesco Micheli, Niklas Schmid, John Lygeros, Efe Balta
AI summary
Problem
RL policies trained in simulation often fail or behave unsafely when deployed on real hardware due to sim-to-real gaps, and existing methods like domain randomization lack formal safety guarantees.
Approach
The authors wrap the Pick-to-Learn (P2L) meta-algorithm around standard RL training to compress the training dataset and compute probabilistic safety bounds on constraint satisfaction before deployment.
Key results
- P2L maintains competitive swingup reward on a real Quanser cartpole while preventing constraint violations.
- Domain randomization policies exhibited risky behavior and violated safety constraints in hardware trials.
- P2L achieves higher percentages of safe runs (up to 100%) on high-dimensional Unitree Go1 simulations.
- Provides calculable out-of-sample probabilistic risk bounds based on a compressed training set size.
Why it matters
Enables safe deployment of RL controllers in robotics by offering practical, data-driven safety certification prior to hardware testing.
Abstract
With the growing acceptance of robotics in daily life there is a growing need for certifiably safe control policies. While simulation provides a safe training environment, policies often fail in sim-to-real transfer. We propose a data-driven certification framework for reinforcement learning based on Pick-to-Learn (P2L), a meta-algorithm that uses data preference ordering to compute probabilistic bounds on the satisfaction of application dependent properties of interest. Our results demonstrate that using P2L maintains high performance while distinguishing between policies that appear similar under domain randomiza- tion alone. This work offers a practical method for preparing safe reinforcement learning policies by providing formal safety guarantees prior to hardware deployment.