PersONAL: Towards a Comprehensive Benchmark for Personalized Embodied Agents
Filippo Ziliotto, Jelin Raphael Akkara, Alessandro Daniele, Lamberto Ballan, Luciano Serafini, Tommaso Campari
AI summary
Problem
Embodied AI agents struggle to interpret and act on user-specific preferences and object ownership in realistic environments, as existing benchmarks lack rigorous personalization and rely on static or image-based cues.
Approach
The authors introduce PersONAL, a benchmark with over 2,000 episodes across 30+ photorealistic homes, where agents must use textual scene descriptions and ownership metadata to navigate to or ground user-specific objects.
Key results
- Released a dataset of 2,000+ high-quality episodes across 30+ HM3D homes with three difficulty levels
- Defined two evaluation modes: active navigation in unseen environments and object grounding in mapped scenes
- Demonstrated a substantial performance gap between state-of-the-art zero-shot baselines and human-level performance
- Improved caption quality and lexical diversity over prior benchmarks like GOAT-Bench
Why it matters
It provides a crucial evaluation framework for developing real-world assistive robots that can understand and act on personalized human preferences in domestic settings.
Abstract
Recent advances in Embodied AI have enabled agents to perform increasingly complex tasks and adapt to diverse environments. However, deploying such agents in re- alistic human-centered scenarios, such as domestic households, remains challenging, particularly due to the difficulty of model- ing individual human preferences and behaviors. In this work, we introduce PersONAL (PERSonalized Object Navigation And Localization), a comprehensive benchmark designed to study personalization in Embodied AI. Agents must identify, retrieve, and navigate to objects associ- ated with specific users, responding to natural-language queries such as find Lily’s backpack. PersONAL comprises over 2,000 high-quality episodes across 30+ photorealistic homes from the HM3D dataset. Each episode includes a natural-language scene description with explicit associations between objects and their owners, requiring agents to reason over user-specific semantics. The benchmark supports two evaluation modes: (1) active navigation in unseen environments, and (2) object grounding in previously mapped scenes. Experiments with state-of-the- art baselines reveal a substantial gap to human performance, highlighting the need for embodied agents capable of perceiving, reasoning, and memorizing over personalized information; paving the way towards real-world assistive robot. Code and dataset available at: github.io/PersONAL