DreamNav: A Trajectory-Based Imaginative Framework for Zero-Shot Vision-And-Language Navigation
Yunheng Wang, Yuetong Fang, Taowen Wang, Yixiao Feng, Yawen Tan, Shuning Zhang, Peiran Liu, Yiding Ji, Renjing Xu
AI summary
Problem
Existing zero-shot vision-and-language navigation methods rely on costly panoramic perception, make short-sighted point-level decisions, and lack long-horizon planning, making them expensive and semantically misaligned.
Approach
DreamNav stabilizes low-cost egocentric inputs with an EgoView Corrector, plans global routes via a Trajectory Predictor, and proactively forecasts future scenarios using an Imagination Predictor, all guided by foundation models without fine-tuning.
Key results
- Alleviates high-cost perception by operating solely on low-cost egocentric RGB-D inputs
- Mitigates short-sightedness through an Imagination Predictor enabling long-range proactive reasoning
- Resolves semantic misalignment with a Trajectory Predictor generating globally coherent navigation paths
- Sets a new zero-shot state-of-the-art on VLN-CE and real-world tests, outperforming strong egocentric baselines by up to 7.49% in SR and 18.15% in SPL
Why it matters
It provides a practical, cost-effective blueprint for deploying reliable, long-horizon embodied agents in real-world continuous environments without expensive sensors or task-specific training.
Abstract
Vision-and-Language Navigation in Continuous Environments (VLN-CE), which links language instructions to perception and control in the real world, is a core capability of embodied robots. Recently, large-scale pretrained foundation models have been leveraged as shared priors for perception, reasoning, and action, enabling zero-shot VLN without task- specific training. However, existing zero-shot VLN methods depend on costly perception and passive scene understanding, collapsing control to point-level choices. As a result, they are expensive to deploy, misaligned in action semantics, and short-sighted in planning. To address these issues, we present DreamNav that focuses on the following three aspects: (1) for reducing sensory cost, our EgoView Corrector aligns viewpoints and stabilizes egocentric perception; (2) instead of point-level actions, our Trajectory Predictor favors global trajectory-level planning to better align with instruction semantics; and (3) to enable anticipatory and long-horizon planning, we propose an Imagination Predictor to endow the agent with proactive thinking capability. On VLN-CE and real-world tests, Dream- Nav sets a new zero-shot state-of-the-art (SOTA), outperforming the strongest egocentric baseline with extra information by up to 7.49% and 18.15% in terms of SR and SPL metrics. To our knowledge, this is the first zero-shot VLN method to unify trajectory-level planning and active imagination while using only egocentric inputs.