← Back ICRA 2026

Better Than Diverse Demonstrators: Reward Decomposition from Suboptimal and Heterogeneous Demonstrations

Chunyue Xue, Letian Chen, Matthew Gombolay

PDF

AI summary

Key figure (auto-extracted from paper)

REPRESENT disentangles shared task rewards from strategy-specific preferences to learn policies that outperform diverse, suboptimal human demonstrators.

Inverse Reinforcement Learning Learning from Demonstration Reward Decomposition Suboptimal Data Heterogeneous Strategies Robotic Policy Learning

Problem

Real-world learning from demonstration relies on non-expert data that is simultaneously suboptimal and heterogeneous, causing traditional inverse reinforcement learning methods to fail or produce ambiguous policies.

Approach

The method jointly models a shared task reward and individual strategy rewards using a custom loss function and noise-performance curve fitting, effectively separating common objectives from demonstrator-specific quirks.

Key results

Disentangles shared task rewards from strategy-specific components
Achieves up to 300% policy performance improvement over baselines
Demonstrates higher correlation with true task rewards across three robotic domains
Enables agents to surpass non-expert demonstrator performance

Why it matters

Provides a scalable solution for real-world robot learning from imperfect, diverse human teaching in domains like healthcare and manufacturing.

Abstract

Inverse Reinforcement Learning (IRL) typically in- volves inferring a reward function from expert demonstrations to enable agents to imitate the demonstrated behavior. However, real-world settings often provide suboptimal and heterogeneous demonstrations, where human demonstrators use diverse strate- gies and imperfect actions. Yet, we are unaware of any prior work that simultaneously addresses the challenges of IRL, of which demonstrations are both heterogeneous and suboptimal. In this work, we propose a novel approach, REPRESENT (Reward dE- comPosition fRom hEterogeneous Suboptimal dEmoNstraTion), that disentangles the latent intrinsic task reward and the strategy- specific reward from suboptimal and diverse strategies. Our method learns to identify a shared task reward component that generalizes across varying demonstrator preferences while also modeling distinct strategy-specific rewards. By decomposing the common task reward across varied demonstrations, REP- RESENT extracts the core objectives shared by all strategies, enabling the agent to perform better than the demonstrators while preserving individual strategy preferences. We validate our approach on three robotic domains, showing a higher correlation with the true task reward and improved policy performance compared to baselines. These results suggest that REPRESENT can effectively handle suboptimality and heterogeneity, providing a solution for real-world LfD applications to better learn from demonstrations varied in quality and strategy.

Index terms

Reinforcement Learning Learning from Demonstration