← Back ICRA 2026

Scalable Multi-Objective Robot Reinforcement Learning through Gradient Conflict Resolution

Humphrey James Lee Munn, Brendan Tidd, Peter Bohm, Marcus Gallagher, David Howard

PDF

AI summary

Key figure (auto-extracted from paper)

GCR-PPO resolves gradient conflicts in multi-objective reinforcement learning, enabling scalable and stable training of complex robotic policies without significant computational overhead.

Multi-objective RL Gradient Conflict Resolution PPO Robot Learning Scalable Control PCGrad

Problem

Combining multiple reward objectives into a single scalar in robotic reinforcement learning masks gradient conflicts, causing brittle policies, poor convergence, and limited scalability as task complexity grows.

Approach

The method decomposes the scalar reward into per-component gradients via a multi-headed critic and applies priority-based gradient projection to resolve conflicts during policy updates.

Key results

+9.5% average performance gain over massive parallel PPO
Superior scaling on high-conflict tasks (Spearman ρ = 0.736)
Enables complex stylized behaviors standard PPO misses
Minimal computational overhead with direct PPO integration

Why it matters

Allows roboticists to reliably train complex, multi-objective policies without manual reward tuning or heavy computational costs.

Abstract

Reinforcement Learning (RL) robot controllers usually aggregate many task objectives into one scalar reward. While large-scale proximal policy optimisation (PPO) has en- abled impressive results such as robust real-world robot locomo- tion, many tasks still require careful reward tuning and remain brittle to local optima. Tuning cost and sub-optimality grow with the number of objectives, limiting scalability. Modelling reward vectors and their trade-offs can address these issues; however, multi-objective methods remain underused in RL for robotics because of computational cost and optimisation diffi- culty. In this work, we study gradient conflicts that arise when multiple task objectives are combined into a scalar reward. In particular, we explicitly address the conflict between task-based rewards and terms that regularise the policy towards realistic behaviour. We propose GCR-PPO, a lightweight modification to PPO that decomposes actor updates into objective-wise gradi- ents using a multi-headed critic and resolves conflicts according to objective priority. We evaluate GCR-PPO on IsaacLab ma- nipulation and locomotion benchmarks and two additional tasks modified to include many objectives. GCR-PPO demonstrates superior scalability compared to massively-parallel PPO (p = 0.04) without significant computational overhead. Across tasks, GCR-PPO improves performance over large-scale PPO by an average of 9.5% (Symmetric Percentage Change), with larger gains on tasks exhibiting higher gradient conflict. Code is available at: https://github.com/humphreymunn/GCR-PPO.

Index terms

Reinforcement Learning Machine Learning for Robot Control Whole-Body Motion Planning and Control