← Back ICRA 2026

Assistant Placement Aria: A Benchmark for Egocentric Placement Assistance

Amir Belder, Gonçalo Dias Pais, Refael Vivanyi, Daniel DeTone, Omri Carmi, Ido, Binyamin Gattegno, Oren Shrout, Ayellet Tal

PDF

AI summary

Key figure (auto-extracted from paper)

Current foundation models struggle with human-centric placement tasks, highlighting the need for a new benchmark that captures diverse spatial and preference constraints.

Virtual Placement Egocentric Vision Human-Centric AI Object Placement Benchmark Robotics Assistance

Problem

Traditional placement tasks focus on single predefined targets, ignoring the complex human-centric reasoning required to identify all plausible locations in a scene. Existing datasets lack scale, diversity, and explicit modeling of human preferences and physical constraints.

Approach

The authors created a benchmark with 2D and 3D annotations plus text descriptions for three placement tasks across synthetic and real egocentric scenes, using manual tagging and VLM-guided automatic annotation to capture human preferences.

Key results

Introduced the first Virtual Placement benchmark covering diverse objects with 2D/3D masks and text descriptions
Developed a scalable VLM-based automatic tagging method that captures human preferences (mean IoU 0.60–0.70)
Established baselines showing state-of-the-art detection and segmentation models struggle on VP tasks (IoU ≤0.46)
Provided over 500,000 annotations across 250 real and synthetic egocentric scenes

Why it matters

It provides a critical data resource for developing AI systems that understand human preferences and physical constraints for assistive robotics and spatial reasoning.

Abstract

Human assistance in robotics spans several tasks such as navigation, object manipulation, and placement, where a key challenge is selecting target destinations that align with human intentions or preferences. We focus on this chal- lenge in the context of Virtual Placement (VP), the task of identifying all plausible target locations given scene context and human-centric constraints. This differs from traditional placement tasks that typically focus on a single, predefined target location. The VP problem is complex, as it requires both global and local reasoning about the scene’s geometry, semantics, and plausibility. To address this gap, we introduce Assistant Placement Aria, the first benchmark to explore diverse aspects of VP, including global, local, and human-centric constraints. It contains both synthetic and real indoor scenes annotated for three tasks: (i) 2D Panel Placement, (ii) Sitting Suggestion, and (iii) TV Placement. Each scene includes 2D images, a 3D point cloud, and a textual description of the objects within the scene. By contributing this benchmark, we aim to encourage further research in this underexplored and challenging field that is critically dependent on relevant data. We also evaluate several foundation models for object detec- tion and segmentation on our benchmark. The benchmark is available at: https://github.com/amirbelder/-Placement-Aria— Benchmark-for-Egocentric-Placement-Assistance.

Index terms

Semantic Scene Understanding Data Sets for Robot Learning Computer Vision for Automation