← Back ICRA 2026

DAPPER: Discriminability-Aware Policy-To-Policy Preference-Based Reinforcement Learning for Query-Efficient Robot Skill Acquisition

Yuki Kadokawa, Jonas Frey, Takahiro Miki, Takamitsu Matsubara, Marco Hutter

PDF

AI summary

Key figure (auto-extracted from paper)

DAPPER significantly improves query efficiency in preference-based RL by prioritizing human-discriminable trajectory comparisons across multiple diverse policies.

Preference-based RL Query Efficiency Policy Discriminability Legged Robots Human-in-the-Loop Reinforcement Learning

Problem

Preference-based RL suffers from low query efficiency because relying on a single policy creates behavioral bias, making trajectory pairs too similar for humans to reliably distinguish and label.

Approach

The framework trains multiple policies from scratch to maximize behavioral diversity, employs a learned discriminator to estimate human discriminability of trajectory pairs, and prioritizes sampling queries that are easiest for humans to judge while jointly optimizing for preference reward.

Key results

Superior query efficiency in simulated and real-world legged robot environments
Successful policy learning with fewer human queries than prior methods
Consistent learning under challenging low-discriminability conditions
Validated on both simulation and physical quadruped robot platforms

Why it matters

Provides a practical, query-efficient pathway for non-experts to customize legged robot behaviors through minimal human feedback.

Abstract

Preference-based Reinforcement Learning (PbRL) enables policy learning through simple queries comparing tra- jectories from a single policy. While human responses to these queries make it possible to learn policies aligned with human preferences, PbRL suffers from low query efficiency, as policy bias limits trajectory diversity and reduces the number of dis- criminable queries available for learning preferences. This paper identifies preference discriminability, which quantifies how easily a human can judge which trajectory is closer to their ideal behav- ior, as a key metric for improving query efficiency. To address this, we move beyond comparisons within a single policy and instead generate queries by comparing trajectories from multiple policies, as training them from scratch promotes diversity without policy bias. We propose Discriminability-Aware Policy-to-Policy Preference-Based Efficient Reinforcement Learning (DAPPER), which integrates preference discriminability with trajectory di- versification achieved by multiple policies. DAPPER trains new policies from scratch after each reward update and employs a discriminator that learns to estimate preference discriminability, enabling the prioritized sampling of more discriminable queries. During training, it jointly maximizes the preference reward and preference discriminability score, encouraging the discovery of highly rewarding and easily distinguishable policies. Experiments in simulated and real-world legged robot environments demon- strate that DAPPER outperforms previous methods in query effi- ciency, particularly under challenging preference discriminability conditions.

Index terms

Reinforcement Learning Human Factors and Human-in-the-Loop Legged Robots