Where I Am & Where to Go: Egocentric Indoor Scene Perception with Agent Interaction for Remote Embodied Visual Grounding
Hongtao Zhang, Yili Tang, Yuan Gao, Jue Zhang, Jidong Zhang, Mingbo Zhao
AI summary
Problem
Current vision-and-language navigation agents struggle with high-level, abstract instructions and unseen environments, often failing to stop at the correct room or accurately locate remote target objects.
Approach
The authors introduce a plug-and-play framework featuring two auxiliary agents that predict the current and target room types from visual and textual inputs, respectively, to guide the main navigation model.
Key results
- Improves navigation success rate by 7.78% on the REVERIE benchmark
- Increases remote grounding success by 5.48% over baseline models
- Achieves competitive results in unseen test environments without additional training data
- Provides a model-agnostic, plug-and-play interaction framework for existing VLN agents
Why it matters
It advances practical embodied AI by enabling robots to reliably execute high-level, abstract instructions in unfamiliar indoor environments.
Abstract
Embodied Referring Expression Grounding (REVERIE) is a Vision-and-Language Navigation (VLN) task that better reflects real-world human instructions. Unlike conventional VLN, REVERIE is more challenging as agents must navigate in unseen environments and ground remote objects described by short, high-level commands. This requires agents not only to plan a route without detailed step-by-step guidance but also to accurately localize the target object at the destination. Existing VLN agents mainly emphasize navigation performance while overlooking object grounding success, leading to a significant performance gap. We introduce a model-agnostic interaction framework with two auxiliary agents, Where-I-Am (WIA) and Where-to-Go (W2G). Specifically, WIA predicts the current room type from environmental observations, while W2G infers the target room type from high-level instructions. Our framework is plug-and-play and can be integrated with various VLN models. On the REVERIE benchmark, it improves navigation success rate (SR) by 7.78% and remote grounding success (RGS) by 5.48% over the baselines, demonstrating the effectiveness and generality of our design. Furthermore, in challenging unseen test environments, our framework achieves competitive results on the REVERIE dataset, outperforming the previous state-of-the-art VLN agent (without additional training data) with a 2.27% gain in RGS.