← Back ICRA 2026

Where I Am & Where to Go: Egocentric Indoor Scene Perception with Agent Interaction for Remote Embodied Visual Grounding

Hongtao Zhang, Yili Tang, Yuan Gao, Jue Zhang, Jidong Zhang, Mingbo Zhao

PDF

AI summary

Key figure (auto-extracted from paper)

Integrating two lightweight auxiliary agents for current and target room prediction significantly boosts navigation success and remote object grounding in embodied AI.

Vision-and-Language Navigation Embodied AI Visual Grounding Room Type Recognition Auxiliary Agents REVERIE Benchmark

Problem

Current vision-and-language navigation agents struggle with high-level, abstract instructions and unseen environments, often failing to stop at the correct room or accurately locate remote target objects.

Approach

The authors introduce a plug-and-play framework featuring two auxiliary agents that predict the current and target room types from visual and textual inputs, respectively, to guide the main navigation model.

Key results

Improves navigation success rate by 7.78% on the REVERIE benchmark
Increases remote grounding success by 5.48% over baseline models
Achieves competitive results in unseen test environments without additional training data
Provides a model-agnostic, plug-and-play interaction framework for existing VLN agents

Why it matters

It advances practical embodied AI by enabling robots to reliably execute high-level, abstract instructions in unfamiliar indoor environments.

Abstract

Embodied Referring Expression Grounding (REVERIE) is a Vision-and-Language Navigation (VLN) task that better reflects real-world human instructions. Unlike conventional VLN, REVERIE is more challenging as agents must navigate in unseen environments and ground remote objects described by short, high-level commands. This requires agents not only to plan a route without detailed step-by-step guidance but also to accurately localize the target object at the destination. Existing VLN agents mainly emphasize navigation performance while overlooking object grounding success, leading to a significant performance gap. We introduce a model-agnostic interaction framework with two auxiliary agents, Where-I-Am (WIA) and Where-to-Go (W2G). Specifically, WIA predicts the current room type from environmental observations, while W2G infers the target room type from high-level instructions. Our framework is plug-and-play and can be integrated with various VLN models. On the REVERIE benchmark, it improves navigation success rate (SR) by 7.78% and remote grounding success (RGS) by 5.48% over the baselines, demonstrating the effectiveness and generality of our design. Furthermore, in challenging unseen test environments, our framework achieves competitive results on the REVERIE dataset, outperforming the previous state-of-the-art VLN agent (without additional training data) with a 2.27% gain in RGS.

Index terms

Vision-Based Navigation Deep Learning for Visual Perception Visual Learning