← Back ICRA 2026

Risk-Aware Reinforcement Learning with Bandit-Based Adaptation for Quadrupedal Locomotion

Yuanhong Zeng, Anushri Dixit

PDF

AI summary

Key figure (auto-extracted from paper)

Adaptively selecting risk-aware policies via an online bandit algorithm nearly doubles locomotion performance and ensures rapid real-world deployment in unknown environments.

Risk-aware reinforcement learning Conditional Value-at-Risk Multi-armed bandit Quadrupedal locomotion Sim-to-real transfer Online adaptation

Problem

Fixed risk levels in reinforcement learning are difficult to calibrate for unknown real-world conditions, often causing overly conservative or unstable quadrupedal locomotion, while existing risk-aware methods suffer from high variance and poor training stability.

Approach

The authors train a family of risk-conditioned policies using a stable, bootstrapped CVaR-constrained PPO algorithm, then deploy an online Upper Confidence Bound multi-armed bandit to dynamically select the optimal policy based solely on observed episodic returns.

Key results

Stable CVaR-constrained PPO with clipped Lagrangian and bootstrapped returns
Online UCB bandit selector adapting risk levels from episodic returns alone
Nearly double mean and tail performance over PPO baselines in perturbed simulations
Successful real-world deployment on a Unitree Go2 with two-minute bandit convergence

Why it matters

Provides a practical, data-driven framework for deploying robust, adaptive locomotion policies on real robots without privileged environment information.

Abstract

In this work, we introduce a risk-aware rein- forcement learning framework for robust quadrupedal loco- motion. Our approach first trains a family of risk-conditioned policies using a Conditional Value-at-Risk (CVaR) constrained optimization technique, which improves both training stability and sample efficiency. During deployment, we frame online policy selection as a multi-armed bandit problem. Relying solely on observed episodic returns rather than privileged environment information, this method dynamically adjusts the robot’s robustness level to handle unknown conditions on the fly. We evaluate our approach in simulation across eight diverse settings—varying dynamics, contacts, sensing noise, and terrain—as well as in real-world trials on a Unitree Go2 robot. Compared to existing baselines, our risk-aware policy achieves nearly twice the mean and tail performance in novel environments, with the bandit algorithm successfully identifying the optimal policy within just two minutes of operation.

Index terms

Legged Robots Reinforcement Learning Robust/Adaptive Control