Accelerating Residual Reinforcement Learning with Uncertainty Estimation
Lakshita Dodeja, Karl Schmeckpeper, Shivam Vats, Thomas Weng, Mingxi Jia, George Konidaris, Stefanie Tellex
AI summary
Problem
Existing residual reinforcement learning methods suffer from unconstrained exploration and are limited to deterministic base policies, making them inefficient and unsuitable for modern stochastic imitation learners.
Approach
The method uses the base policy's uncertainty estimates to restrict exploration to uncertain states and modifies the off-policy critic to learn Q-values for the combined base and residual actions, enabling stable training with stochastic policies.
Key results
- Uncertainty-guided exploration focuses residual learning on high-uncertainty states
- Asymmetric actor-critic formulation enables off-policy residual RL with stochastic base policies
- Outperforms state-of-the-art finetuning, demo-augmented, and residual RL baselines across simulation benchmarks
- Demonstrates successful zero-shot sim-to-real transfer on a physical robot
Why it matters
Enables more sample-efficient and robust adaptation of modern stochastic robot policies, accelerating practical deployment in real-world environments.
Abstract
Residual Reinforcement Learning (RL) is a popular approach for adapting pretrained policies by learning a lightweight residual policy that provides corrective actions. While Residual RL is more sample-efficient than finetuning the entire base policy, existing methods struggle with sparse rewards and are designed for deterministic base policies. We propose two improvements to Residual RL that further enhance its sample efficiency and make it suitable for stochastic base policies. First, we leverage uncertainty estimates of the base policy to focus exploration on regions in which the base policy is not confident. Second, we propose a simple modi- fication to off-policy residual learning that allows it to observe base actions and better handle stochastic base policies. We evaluate our method with both Gaussian-based and Diffusion-based stochastic base policies on tasks from Robosuite and D4RL, and compare against state-of-the-art finetuning methods, demo-augmented RL methods, and other Residual RL methods. Our algorithm signif- icantly outperforms existing baselines in a variety of simulation benchmark environments. We also deploy our learned policies in the real world to demonstrate their robustness with zero-shot sim-to-real transfer.