Risk-Aware Reinforcement Learning with Bandit-Based Adaptation for Quadrupedal Locomotion
Yuanhong Zeng, Anushri Dixit
AI summary
Problem
Fixed risk levels in reinforcement learning are difficult to calibrate for unknown real-world conditions, often causing overly conservative or unstable quadrupedal locomotion, while existing risk-aware methods suffer from high variance and poor training stability.
Approach
The authors train a family of risk-conditioned policies using a stable, bootstrapped CVaR-constrained PPO algorithm, then deploy an online Upper Confidence Bound multi-armed bandit to dynamically select the optimal policy based solely on observed episodic returns.
Key results
- Stable CVaR-constrained PPO with clipped Lagrangian and bootstrapped returns
- Online UCB bandit selector adapting risk levels from episodic returns alone
- Nearly double mean and tail performance over PPO baselines in perturbed simulations
- Successful real-world deployment on a Unitree Go2 with two-minute bandit convergence
Why it matters
Provides a practical, data-driven framework for deploying robust, adaptive locomotion policies on real robots without privileged environment information.
Abstract
In this work, we introduce a risk-aware rein- forcement learning framework for robust quadrupedal loco- motion. Our approach first trains a family of risk-conditioned policies using a Conditional Value-at-Risk (CVaR) constrained optimization technique, which improves both training stability and sample efficiency. During deployment, we frame online policy selection as a multi-armed bandit problem. Relying solely on observed episodic returns rather than privileged environment information, this method dynamically adjusts the robot’s robustness level to handle unknown conditions on the fly. We evaluate our approach in simulation across eight diverse settings—varying dynamics, contacts, sensing noise, and terrain—as well as in real-world trials on a Unitree Go2 robot. Compared to existing baselines, our risk-aware policy achieves nearly twice the mean and tail performance in novel environments, with the bandit algorithm successfully identifying the optimal policy within just two minutes of operation.