VISTA: Generative Visual Imagination for Vision-And-Language Navigation
Yanjia Huang, Mingyang Wu, Renjie Li, Zhengzhong Tu
AI summary
Problem
Existing Vision-and-Language Navigation agents struggle in long-horizon scenarios due to reliance on immediate observations, inability to anticipate unseen environments, and gaps between vision and language modalities.
Approach
VISTA uses a closed-loop framework that dynamically generates goal visualizations via diffusion models, aligns them with real-time observations through a perceptual filter, and reasons over actions using Chain-of-Thought prompting.
Key results
- State-of-the-art performance on R2R and RoboTHOR benchmarks
- +3.6% increase in Success Rate on R2R
- Adaptive Imagination Scheduler for dynamic goal prediction
- Perceptual Alignment Filter for interpretable visual grounding
Why it matters
Provides a cognitively inspired, interpretable navigation framework that bridges generative AI and embodied decision-making for robotics and autonomous agents.
Abstract
Vision-and-Language Navigation (VLN) tasks agents with locating specific objects in unseen environments using natural language instructions and visual cues. Many existing VLN approaches typically follow an ‘observe-and- reason’ schema, that is, agents observe the environment and decide on the next action to take based on the visual ob- servations of their surroundings. They often face challenges in long-horizon scenarios due to limitations in immediate observation and vision-language modality gaps. To overcome this, we present VISTA, a novel framework that employs an ‘imagine-and-align navigation strategy. Specifically, we leverage the generative prior of pre-trained diffusion models for dynamic visual imagination conditioned on both local observations and high-level language instructions. A Perceptual Alignment Filter module then grounds these goal imaginations against current observations, guiding an interpretable and structured reasoning process for action selection. Experiments show that VISTA sets new state-of-the-art results on Room-to-Room (R2R) and RoboTHOR benchmarks, e.g., +3.6% increase in Success Rate on R2R. Extensive ablation analysis underscores the value of integrating forward-looking imagination, perceptual alignment, and structured reasoning for robust navigation in long-horizon environments. Key Words: Vision-and-Language Navigation, Diffusion Models, Vision Language Models