Boosting Zero-Shot VLN Via Abstract Obstacle Map-Based Waypoint Prediction with TopoGraph-And-VisitInfo-Aware Prompting
Boqi Li, Siyuan Li, Weiyi Wang, Anran Li, Zhong Cao, Henry Liu
AI summary
Problem
Existing zero-shot VLN methods struggle with waypoint feasibility, spatial tracking, and exploration history, causing agents to get stuck or lose context in continuous environments.
Approach
The framework predicts linearly reachable waypoints from a simplified 2D obstacle map and feeds a dynamically updated topological graph with visitation records into MLLM prompts to guide exploration and correct navigation errors.
Key results
- 41% success rate on R2R-CE and 36% on RxR-CE
- Outperforms prior state-of-the-art zero-shot VLN methods
- Gradient-based obstacle map construction improves waypoint feasibility
- Enables MLLM-driven local path planning and re-exploration
Why it matters
Offers a scalable, training-free navigation strategy for embodied AI, reducing reliance on complex sensory inputs and extensive task-specific data.
Abstract
Obstacle Map-Based Waypoint Prediction with TopoGraph-and-VisitInfo-Aware Prompting Boqi Li*, Siyuan Li*, Weiyi Wang, Anran Li, Zhong Cao, and Henry X. Liu†, Fellow, IEEE Abstract— With the rapid progress of foundation models and robotics, vision-language navigation (VLN) has emerged as a key task for embodied agents with broad practical applications. We address VLN in continuous environments, a particularly challenging setting where an agent must jointly interpret natural language instructions, perceive its surroundings, and plan low-level actions. We propose a zero-shot framework that integrates a simplified yet effective waypoint predictor with a multimodal large language model (MLLM). The predictor oper- ates on an abstract obstacle map, producing linearly reachable waypoints, which are incorporated into a dynamically updated topological graph with explicit visitation records. The graph and visitation information are encoded into the prompt, enabling reasoning over both spatial structure and exploration history to encourage exploration and equip MLLM with local path planning for error correction. Extensive experiments on R2R- CE and RxR-CE show that our method achieves state-of-the- art zero-shot performance, with success rates of 41% and 36%, respectively, outperforming prior state-of-the-art methods. The source code is available at: https://github.com/michigan-traffic- lab/OMAP-VLN.