← Back ICRA 2026

Efficiently Learning Robust Torque-Based Locomotion through Reinforcement with Model-Based Supervision

Yashuai Yan, Tobias Egle, Christian Ott, Dongheui Lee

PDF

AI summary

Key figure (auto-extracted from paper)

Supervising a residual reinforcement learning policy with a privileged model-based oracle enables robust, torque-controlled bipedal walking under severe real-world uncertainties.

bipedal locomotion residual reinforcement learning model-based supervision sim-to-real transfer torque control domain randomization

Problem

Pure reinforcement learning for torque-based bipedal locomotion suffers from poor sample efficiency and heavy reward engineering, while traditional model-based controllers degrade under unmodeled dynamics and sensor noise.

Approach

The authors combine a reliable model-based base controller with a residual RL policy, training it via domain randomization and guiding it with a privileged oracle policy that knows ground-truth dynamics through a combined supervised and RL loss.

Key results

Up to 100% walking success rates across three bipedal platforms under heavy domain randomization
Substantially lower DCM and foot tracking errors versus model-based and pure RL baselines
Robust performance maintained across increasing uncertainty levels without retraining
Faster policy convergence via oracle supervision, minimizing manual reward engineering

Why it matters

Offers a practical, scalable pathway for deploying robust torque-based locomotion controllers on real-world bipedal robots by bridging model-based reliability with data-driven adaptability.

Abstract

We propose a control framework that integrates model-based bipedal locomotion with residual reinforcement learning (RL) to achieve robust and adaptive walking in the presence of real-world uncertainties. Our approach leverages a model-based controller—comprising a Divergent Component of Motion (DCM) trajectory planner and a whole-body con- troller—as a reliable base policy. To address the uncertainties of inaccurate dynamics modeling and sensor noise, we introduce a residual policy trained through RL with domain randomization. Crucially, we employ a model-based oracle policy, which has privileged access to ground-truth dynamics during training, to supervise the residual policy via a novel supervised loss. This supervision enables the policy to efficiently learn corrective be- haviors that compensate for unmodeled effects without extensive reward shaping. Our method demonstrates improved robustness and generalization across a range of randomized conditions, offering a scalable solution for sim-to-real transfer in bipedal locomotion.

Index terms

Humanoid and Bipedal Locomotion Reinforcement Learning