← Back ICRA 2026

Boosting Zero-Shot VLN Via Abstract Obstacle Map-Based Waypoint Prediction with TopoGraph-And-VisitInfo-Aware Prompting

Boqi Li, Siyuan Li, Weiyi Wang, Anran Li, Zhong Cao, Henry Liu

PDF

AI summary

A lightweight obstacle-map waypoint predictor combined with topology-aware prompting achieves state-of-the-art zero-shot vision-language navigation in continuous environments.

Zero-shot navigation Vision-language navigation Waypoint prediction Multimodal LLMs Topological graphs Continuous environments

Problem

Existing zero-shot VLN methods struggle with waypoint feasibility, spatial tracking, and exploration history, causing agents to get stuck or lose context in continuous environments.

Approach

The framework predicts linearly reachable waypoints from a simplified 2D obstacle map and feeds a dynamically updated topological graph with visitation records into MLLM prompts to guide exploration and correct navigation errors.

Key results

41% success rate on R2R-CE and 36% on RxR-CE
Outperforms prior state-of-the-art zero-shot VLN methods
Gradient-based obstacle map construction improves waypoint feasibility
Enables MLLM-driven local path planning and re-exploration

Why it matters

Offers a scalable, training-free navigation strategy for embodied AI, reducing reliance on complex sensory inputs and extensive task-specific data.

Abstract

Obstacle Map-Based Waypoint Prediction with TopoGraph-and-VisitInfo-Aware Prompting Boqi Li*, Siyuan Li*, Weiyi Wang, Anran Li, Zhong Cao, and Henry X. Liu†, Fellow, IEEE Abstract— With the rapid progress of foundation models and robotics, vision-language navigation (VLN) has emerged as a key task for embodied agents with broad practical applications. We address VLN in continuous environments, a particularly challenging setting where an agent must jointly interpret natural language instructions, perceive its surroundings, and plan low-level actions. We propose a zero-shot framework that integrates a simplified yet effective waypoint predictor with a multimodal large language model (MLLM). The predictor oper- ates on an abstract obstacle map, producing linearly reachable waypoints, which are incorporated into a dynamically updated topological graph with explicit visitation records. The graph and visitation information are encoded into the prompt, enabling reasoning over both spatial structure and exploration history to encourage exploration and equip MLLM with local path planning for error correction. Extensive experiments on R2R- CE and RxR-CE show that our method achieves state-of-the- art zero-shot performance, with success rates of 41% and 36%, respectively, outperforming prior state-of-the-art methods. The source code is available at: https://github.com/michigan-traffic- lab/OMAP-VLN.

Index terms

Vision-Based Navigation Semantic Scene Understanding