← Back ICRA 2024

POLITE: Preferences Combined with Highlights in Reinforcement Learning

Simon Holk, Daniel Marta, Iolanda Leite

PDF

Abstract

Many solutions to address the challenge of robot learning have been devised, namely through exploring novel ways for humans to communicate complex goals and tasks in reinforcement learning (RL) setups. One way that experienced recent research interest directly addresses the problem by considering human feedback as preferences between pairs of trajectories (sequences of state-action pairs). However, when simply attributing a single preference to a pair of trajectories that contain many agglomerated steps, key pieces of information are lost in the process. We amplify the initial definition of preferences to account for highlights: state-action pairs of relatively high information (high/low reward) within a preferred trajectory. To include the additional information, we design novel regularization methods within a preference learning framework. To this extent, we present our method which is able to greatly reduce the necessary amount of preferences, by permitting the highlighting of favoured trajectories, in order to reduce the entropy of the credit assignment. We show the effectiveness of our work in both simulation and a user study, which analyzes the feedback given and its implications. We also use the total collected feedback to train a robot policy for socially compliant trajectories in a simulated social navigation environment. We release code and video examples at https://sites.google.com/view/rl-polite

Index terms

Human Factors and Human-in-the-Loop Reinforcement Learning Representation Learning