DAPPER: Discriminability-Aware Policy-To-Policy Preference-Based Reinforcement Learning for Query-Efficient Robot Skill Acquisition
Yuki Kadokawa, Jonas Frey, Takahiro Miki, Takamitsu Matsubara, Marco Hutter
AI summary
Problem
Preference-based RL suffers from low query efficiency because relying on a single policy creates behavioral bias, making trajectory pairs too similar for humans to reliably distinguish and label.
Approach
The framework trains multiple policies from scratch to maximize behavioral diversity, employs a learned discriminator to estimate human discriminability of trajectory pairs, and prioritizes sampling queries that are easiest for humans to judge while jointly optimizing for preference reward.
Key results
- Superior query efficiency in simulated and real-world legged robot environments
- Successful policy learning with fewer human queries than prior methods
- Consistent learning under challenging low-discriminability conditions
- Validated on both simulation and physical quadruped robot platforms
Why it matters
Provides a practical, query-efficient pathway for non-experts to customize legged robot behaviors through minimal human feedback.
Abstract
Preference-based Reinforcement Learning (PbRL) enables policy learning through simple queries comparing tra- jectories from a single policy. While human responses to these queries make it possible to learn policies aligned with human preferences, PbRL suffers from low query efficiency, as policy bias limits trajectory diversity and reduces the number of dis- criminable queries available for learning preferences. This paper identifies preference discriminability, which quantifies how easily a human can judge which trajectory is closer to their ideal behav- ior, as a key metric for improving query efficiency. To address this, we move beyond comparisons within a single policy and instead generate queries by comparing trajectories from multiple policies, as training them from scratch promotes diversity without policy bias. We propose Discriminability-Aware Policy-to-Policy Preference-Based Efficient Reinforcement Learning (DAPPER), which integrates preference discriminability with trajectory di- versification achieved by multiple policies. DAPPER trains new policies from scratch after each reward update and employs a discriminator that learns to estimate preference discriminability, enabling the prioritized sampling of more discriminable queries. During training, it jointly maximizes the preference reward and preference discriminability score, encouraging the discovery of highly rewarding and easily distinguishable policies. Experiments in simulated and real-world legged robot environments demon- strate that DAPPER outperforms previous methods in query effi- ciency, particularly under challenging preference discriminability conditions.