MOSAIC: Multi-Objective Optimization from Zero-Shot Language Reasoning in Preference-Based RL
Daniel Marta, Simon Holk, Iolanda Leite
AI summary
Problem
Existing preference-based RL methods typically collapse human feedback into a single reward function, ignoring the multi-dimensional nature of human preferences and the causal reasoning behind them, which leads to objective collapse and causal confusion.
Approach
MOSAIC uses zero-shot large language models to parse natural language prompts accompanying human preferences, extracting distinct objectives and their relative weights to train an ensemble of reward functions optimized via multi-objective reinforcement learning.
Key results
- Introduces MOSAIC framework for multi-objective preference-based RL
- Proposes weighted ensemble variance query sampling for informative feedback selection
- Develops sentiment-based reward regularization to highlight critical trajectory segments
- Demonstrates superior performance over single-preference baselines across simulated and real human feedback tasks
Why it matters
Enables robots to accurately learn complex, multi-dimensional goals from natural language feedback, advancing practical human-robot alignment and preference-based control.
Abstract
Preference-based Reinforcement Learning (RL) enables humans to shape complex goals via preference com- parisons between sequences of state-action pairs. Most of the existing approaches focus on a singular objective, overlooking the complex causal reasoning that underpins preferences. However, many real-world challenges are multi-dimensional, and individuals can have different reasons behind their preferences. In this work, we rethink preference-based RL from a multi- objective perspective by distilling human preferences into multiple components. We leverage the zero-shot capabilities of large language models (LLMs) to infer preferences and better align various objectives from text prompts. This allows us to train an ensemble of reward functions, each optimizing for a specific objective. We demonstrate that our approach can address a variety of multi-objective control tasks, improving on approaches that consider a single preference per objective. We show the effectiveness of our approach in better shaping reward functions by utilizing real human preferences and prompts. Our code for the benchmarks, along with additional supplementary details, is available at https://sites.google.com/view/multi-pref/.