← Back ICRA 2026

Residual Off-Policy RL for Finetuning Behavior Cloning Policies

Lars Ankile, Zhenyu Jiang, Yan Duan, Guanya Shi, Pieter Abbeel, Anusha Nagabandi

PDF

AI summary

Key figure (auto-extracted from paper)

Freezing behavior cloning policies and learning lightweight off-policy residuals enables sample-efficient, safe real-world reinforcement learning on high-DoF humanoid robots using only sparse rewards.

Residual RL Behavior Cloning Real-world RL Off-policy learning Dexterous manipulation Sample efficiency

Problem

Direct reinforcement learning on real-world, high-degree-of-freedom robots suffers from sample inefficiency and safety risks, while behavior cloning alone plateaus due to its reliance on limited human demonstrations.

Approach

ResFiT freezes a pre-trained behavior cloning policy as a fixed base and uses sample-efficient off-policy RL to learn per-step residual corrections, enabling safe exploration and closed-loop refinement without retraining the large base model.

Key results

State-of-the-art simulation performance on vision-based sparse-reward tasks
Sample-efficient real-world RL training on a 29-DoF bimanual humanoid
First successful real-world RL on a humanoid with dexterous five-fingered hands
Extensive ablations validating high update-to-data ratios, n-step returns, and layer normalization

Why it matters

Provides a practical, scalable pathway for deploying reinforcement learning on complex real-world robots without relying on dense rewards or extensive simulation-to-real transfer.

Abstract

Recent advances in behavior cloning (BC) have enabled impressive visuomotor control policies. However, these approaches are limited by the quality of human demonstrations, the manual effort required for data collection, and the dimin- ishing returns from offline data. In comparison, reinforcement learning (RL) trains an agent through autonomous interaction with the environment and has shown remarkable success in various domains. Still, training RL policies directly on real- world robots remains challenging due to sample inefficiency and safety concerns. These challenges are compounded for high- degree-of-freedom (DoF) systems that must learn from sparse rewards over long horizons. We present a recipe that combines the benefits of BC and RL through a residual learning framework. Our approach leverages BC policies as black-box bases and learns lightweight per- step residual corrections via sample-efficient off-policy RL. We demonstrate that our method requires only sparse binary re- ward signals and can effectively improve manipulation policies on high-degree-of-freedom (DoF) systems in both simulation and the real world. In particular, we demonstrate, to the best of our knowledge, the first successful real-world RL training on a humanoid robot with dexterous hands. Our results demonstrate state-of-the-art performance in various vision- based tasks, pointing towards a practical pathway for deploying RL in the real world. Project website: residual-offpolicy-rl.github.io.

Index terms

Bimanual Manipulation Reinforcement Learning Imitation Learning