Residual Off-Policy RL for Finetuning Behavior Cloning Policies
Lars Ankile, Zhenyu Jiang, Yan Duan, Guanya Shi, Pieter Abbeel, Anusha Nagabandi
AI summary
Problem
Direct reinforcement learning on real-world, high-degree-of-freedom robots suffers from sample inefficiency and safety risks, while behavior cloning alone plateaus due to its reliance on limited human demonstrations.
Approach
ResFiT freezes a pre-trained behavior cloning policy as a fixed base and uses sample-efficient off-policy RL to learn per-step residual corrections, enabling safe exploration and closed-loop refinement without retraining the large base model.
Key results
- State-of-the-art simulation performance on vision-based sparse-reward tasks
- Sample-efficient real-world RL training on a 29-DoF bimanual humanoid
- First successful real-world RL on a humanoid with dexterous five-fingered hands
- Extensive ablations validating high update-to-data ratios, n-step returns, and layer normalization
Why it matters
Provides a practical, scalable pathway for deploying reinforcement learning on complex real-world robots without relying on dense rewards or extensive simulation-to-real transfer.
Abstract
Recent advances in behavior cloning (BC) have enabled impressive visuomotor control policies. However, these approaches are limited by the quality of human demonstrations, the manual effort required for data collection, and the dimin- ishing returns from offline data. In comparison, reinforcement learning (RL) trains an agent through autonomous interaction with the environment and has shown remarkable success in various domains. Still, training RL policies directly on real- world robots remains challenging due to sample inefficiency and safety concerns. These challenges are compounded for high- degree-of-freedom (DoF) systems that must learn from sparse rewards over long horizons. We present a recipe that combines the benefits of BC and RL through a residual learning framework. Our approach leverages BC policies as black-box bases and learns lightweight per- step residual corrections via sample-efficient off-policy RL. We demonstrate that our method requires only sparse binary re- ward signals and can effectively improve manipulation policies on high-degree-of-freedom (DoF) systems in both simulation and the real world. In particular, we demonstrate, to the best of our knowledge, the first successful real-world RL training on a humanoid robot with dexterous hands. Our results demonstrate state-of-the-art performance in various vision- based tasks, pointing towards a practical pathway for deploying RL in the real world. Project website: residual-offpolicy-rl.github.io.