Failure-Aware RL: Reliable Offline-To-Online Reinforcement Learning with Self-Recovery for Real-World Manipulation
improving task performance.
AI summary
Problem
Offline-to-online reinforcement learning for robotics frequently causes irreversible Intervention-requiring Failures during exploration, blocking safe real-world deployment.
Approach
FARL trains a latent world model to predict near-future failures offline, then uses a fixed recovery policy to override risky actions during online policy fine-tuning.
Key results
- Introduced FailureBench benchmark for failure-aware RL evaluation
- Reduced real-world intervention-requiring failures by 73.1%
- Improved average task performance by 11.3% during online fine-tuning
- Achieved up to 65.8% failure reduction in challenging simulated environments
Why it matters
Enables safer, more efficient real-world robotic policy refinement by minimizing costly human interventions during reinforcement learning post-training.
Abstract
Post-training algorithms based on deep reinforce- ment learning can push the limits of robotic models for specific objectives, such as generalizability, accuracy, and robustness. However, Intervention-requiring Failures (IR Failures) (e.g., a robot spilling water or breaking fragile glass) during real-world exploration happen inevitably, hindering the practical deploy- ment of such a paradigm. To tackle this, we introduce Failure- Aware Offline-to-Online Reinforcement Learning (FARL), a framework for minimizing failures during real-world rein- forcement learning. We create FailureBench, a benchmark that incorporates common failure scenarios requiring human intervention, and propose an algorithm that integrates a world- model-based safety critic and a recovery policy trained offline to prevent failures during online exploration. Extensive simulation and real-world experiments demonstrate the effectiveness of FARL in significantly reducing IR Failures while improving performance and generalization during online reinforcement learning post-training. FARL reduces IR Failures by 73.1% while elevating performance by 11.3% on average during real- world RL post-training.