Scalable Multi-Objective Robot Reinforcement Learning through Gradient Conflict Resolution
Humphrey James Lee Munn, Brendan Tidd, Peter Bohm, Marcus Gallagher, David Howard
AI summary
Problem
Combining multiple reward objectives into a single scalar in robotic reinforcement learning masks gradient conflicts, causing brittle policies, poor convergence, and limited scalability as task complexity grows.
Approach
The method decomposes the scalar reward into per-component gradients via a multi-headed critic and applies priority-based gradient projection to resolve conflicts during policy updates.
Key results
- +9.5% average performance gain over massive parallel PPO
- Superior scaling on high-conflict tasks (Spearman ρ = 0.736)
- Enables complex stylized behaviors standard PPO misses
- Minimal computational overhead with direct PPO integration
Why it matters
Allows roboticists to reliably train complex, multi-objective policies without manual reward tuning or heavy computational costs.
Abstract
Reinforcement Learning (RL) robot controllers usually aggregate many task objectives into one scalar reward. While large-scale proximal policy optimisation (PPO) has en- abled impressive results such as robust real-world robot locomo- tion, many tasks still require careful reward tuning and remain brittle to local optima. Tuning cost and sub-optimality grow with the number of objectives, limiting scalability. Modelling reward vectors and their trade-offs can address these issues; however, multi-objective methods remain underused in RL for robotics because of computational cost and optimisation diffi- culty. In this work, we study gradient conflicts that arise when multiple task objectives are combined into a scalar reward. In particular, we explicitly address the conflict between task-based rewards and terms that regularise the policy towards realistic behaviour. We propose GCR-PPO, a lightweight modification to PPO that decomposes actor updates into objective-wise gradi- ents using a multi-headed critic and resolves conflicts according to objective priority. We evaluate GCR-PPO on IsaacLab ma- nipulation and locomotion benchmarks and two additional tasks modified to include many objectives. GCR-PPO demonstrates superior scalability compared to massively-parallel PPO (p = 0.04) without significant computational overhead. Across tasks, GCR-PPO improves performance over large-scale PPO by an average of 9.5% (Symmetric Percentage Change), with larger gains on tasks exhibiting higher gradient conflict. Code is available at: https://github.com/humphreymunn/GCR-PPO.