← Back ICRA 2026

DreamNav: A Trajectory-Based Imaginative Framework for Zero-Shot Vision-And-Language Navigation

Yunheng Wang, Yuetong Fang, Taowen Wang, Yixiao Feng, Yawen Tan, Shuning Zhang, Peiran Liu, Yiding Ji, Renjing Xu

PDF

AI summary

Key figure (auto-extracted from paper)

DreamNav achieves state-of-the-art zero-shot navigation by unifying low-cost egocentric perception, trajectory-level planning, and active imagination without task-specific training.

Zero-shot VLN Trajectory Planning Egocentric Perception Active Imagination Foundation Models Embodied AI

Problem

Existing zero-shot vision-and-language navigation methods rely on costly panoramic perception, make short-sighted point-level decisions, and lack long-horizon planning, making them expensive and semantically misaligned.

Approach

DreamNav stabilizes low-cost egocentric inputs with an EgoView Corrector, plans global routes via a Trajectory Predictor, and proactively forecasts future scenarios using an Imagination Predictor, all guided by foundation models without fine-tuning.

Key results

Alleviates high-cost perception by operating solely on low-cost egocentric RGB-D inputs
Mitigates short-sightedness through an Imagination Predictor enabling long-range proactive reasoning
Resolves semantic misalignment with a Trajectory Predictor generating globally coherent navigation paths
Sets a new zero-shot state-of-the-art on VLN-CE and real-world tests, outperforming strong egocentric baselines by up to 7.49% in SR and 18.15% in SPL

Why it matters

It provides a practical, cost-effective blueprint for deploying reliable, long-horizon embodied agents in real-world continuous environments without expensive sensors or task-specific training.

Abstract

Vision-and-Language Navigation in Continuous Environments (VLN-CE), which links language instructions to perception and control in the real world, is a core capability of embodied robots. Recently, large-scale pretrained foundation models have been leveraged as shared priors for perception, reasoning, and action, enabling zero-shot VLN without task- specific training. However, existing zero-shot VLN methods depend on costly perception and passive scene understanding, collapsing control to point-level choices. As a result, they are expensive to deploy, misaligned in action semantics, and short-sighted in planning. To address these issues, we present DreamNav that focuses on the following three aspects: (1) for reducing sensory cost, our EgoView Corrector aligns viewpoints and stabilizes egocentric perception; (2) instead of point-level actions, our Trajectory Predictor favors global trajectory-level planning to better align with instruction semantics; and (3) to enable anticipatory and long-horizon planning, we propose an Imagination Predictor to endow the agent with proactive thinking capability. On VLN-CE and real-world tests, Dream- Nav sets a new zero-shot state-of-the-art (SOTA), outperforming the strongest egocentric baseline with extra information by up to 7.49% and 18.15% in terms of SR and SPL metrics. To our knowledge, this is the first zero-shot VLN method to unify trajectory-level planning and active imagination while using only egocentric inputs.

Index terms

Vision-Based Navigation AI-Enabled Robotics Task and Motion Planning