Research Analyzer
← Back ICRA 2026

Where I Am & Where to Go: Egocentric Indoor Scene Perception with Agent Interaction for Remote Embodied Visual Grounding

Hongtao Zhang, Yili Tang, Yuan Gao, Jue Zhang, Jidong Zhang, Mingbo Zhao

PDF

AI summary

Key figure (auto-extracted from paper)
Integrating two lightweight auxiliary agents for current and target room prediction significantly boosts navigation success and remote object grounding in embodied AI.
Vision-and-Language Navigation Embodied AI Visual Grounding Room Type Recognition Auxiliary Agents REVERIE Benchmark

Problem

Current vision-and-language navigation agents struggle with high-level, abstract instructions and unseen environments, often failing to stop at the correct room or accurately locate remote target objects.

Approach

The authors introduce a plug-and-play framework featuring two auxiliary agents that predict the current and target room types from visual and textual inputs, respectively, to guide the main navigation model.

Key results

  • Improves navigation success rate by 7.78% on the REVERIE benchmark
  • Increases remote grounding success by 5.48% over baseline models
  • Achieves competitive results in unseen test environments without additional training data
  • Provides a model-agnostic, plug-and-play interaction framework for existing VLN agents

Why it matters

It advances practical embodied AI by enabling robots to reliably execute high-level, abstract instructions in unfamiliar indoor environments.

Abstract

Embodied Referring Expression Grounding (REVERIE) is a Vision-and-Language Navigation (VLN) task that better reflects real-world human instructions. Unlike conventional VLN, REVERIE is more challenging as agents must navigate in unseen environments and ground remote objects described by short, high-level commands. This requires agents not only to plan a route without detailed step-by-step guidance but also to accurately localize the target object at the destination. Existing VLN agents mainly emphasize navigation performance while overlooking object grounding success, leading to a significant performance gap. We introduce a model-agnostic interaction framework with two auxiliary agents, Where-I-Am (WIA) and Where-to-Go (W2G). Specifically, WIA predicts the current room type from environmental observations, while W2G infers the target room type from high-level instructions. Our framework is plug-and-play and can be integrated with various VLN models. On the REVERIE benchmark, it improves navigation success rate (SR) by 7.78% and remote grounding success (RGS) by 5.48% over the baselines, demonstrating the effectiveness and generality of our design. Furthermore, in challenging unseen test environments, our framework achieves competitive results on the REVERIE dataset, outperforming the previous state-of-the-art VLN agent (without additional training data) with a 2.27% gain in RGS.

Index terms

Vision-Based Navigation Deep Learning for Visual Perception Visual Learning

Related papers