← Back ICRA 2026

CLF-RL: Control Lyapunov Function Guided Reinforcement Learning

Kejun Li, Zachary Olkin, Yisong Yue, Aaron Ames

PDF

AI summary

Key figure (auto-extracted from paper)

Embedding Control Lyapunov Functions into reinforcement learning rewards significantly improves the robustness and tracking performance of bipedal locomotion policies during real-world deployment.

Bipedal locomotion Reinforcement learning Control Lyapunov functions Reward shaping Humanoid robots Sim-to-real transfer

Problem

Reinforcement learning for bipedal robots struggles with tedious, heuristic reward design and poor sim-to-real transfer, while traditional model-based controllers are computationally heavy and sensitive to model mismatch.

Approach

The method guides RL training by embedding model-based reference trajectories and a Control Lyapunov Function decrease condition directly into the reward, providing structured stability guarantees without constraining the policy's action space.

Key results

Significantly improved robustness and tracking performance over baseline RL policies
Reduced variance under randomized model perturbations in simulation
Successful real-world deployment on a Unitree G1 humanoid robot
Lightweight inference policy with minimal reward tuning required

Why it matters

It offers a principled, computationally efficient bridge between control theory and reinforcement learning, enabling more reliable and robust legged locomotion for real-world robotics applications.

Abstract

Reinforcement learning (RL) has shown promise in generating robust locomotion policies for bipedal robots, but often suffers from tedious reward design and sensitivity to poorly shaped objectives. In this work, we propose a structured reward shaping framework that leverages model-based trajectory gen- eration and control Lyapunov functions (CLFs) to guide policy learning. We explore two model-based planners for generating reference trajectories: a reduced-order linear inverted pendulum (LIP) model for velocity-conditioned motion planning, and a precomputed gait library based on hybrid zero dynamics (HZD) using full-order dynamics. These planners define desired end- effector and joint trajectories, which are used to construct CLF- based rewards that penalize tracking error and encourage rapid convergence. This formulation provides meaningful intermediate rewards, and is straightforward to implement once a reference is available. Both the reference trajectories and CLF shaping are used only during training, resulting in a lightweight policy at deployment. We validate our method both in simulation and through extensive real-world experiments on a Unitree G1 robot. CLF-RL demonstrates significantly improved robustness relative to the baseline RL policy and better performance than a classic tracking reward RL formulation.

Index terms

Humanoid and Bipedal Locomotion Reinforcement Learning