← Back ICRA 2026

VISTA: Generative Visual Imagination for Vision-And-Language Navigation

Yanjia Huang, Mingyang Wu, Renjie Li, Zhengzhong Tu

PDF

AI summary

Key figure (auto-extracted from paper)

Integrating dynamic visual imagination with perceptual alignment and structured reasoning significantly improves navigation success and interpretability in long-horizon VLN tasks.

Vision-and-Language Navigation Visual Imagination Diffusion Models Perceptual Alignment Chain-of-Thought Reasoning Embodied AI

Problem

Existing Vision-and-Language Navigation agents struggle in long-horizon scenarios due to reliance on immediate observations, inability to anticipate unseen environments, and gaps between vision and language modalities.

Approach

VISTA uses a closed-loop framework that dynamically generates goal visualizations via diffusion models, aligns them with real-time observations through a perceptual filter, and reasons over actions using Chain-of-Thought prompting.

Key results

State-of-the-art performance on R2R and RoboTHOR benchmarks
+3.6% increase in Success Rate on R2R
Adaptive Imagination Scheduler for dynamic goal prediction
Perceptual Alignment Filter for interpretable visual grounding

Why it matters

Provides a cognitively inspired, interpretable navigation framework that bridges generative AI and embodied decision-making for robotics and autonomous agents.

Abstract

Vision-and-Language Navigation (VLN) tasks agents with locating specific objects in unseen environments using natural language instructions and visual cues. Many existing VLN approaches typically follow an ‘observe-and- reason’ schema, that is, agents observe the environment and decide on the next action to take based on the visual ob- servations of their surroundings. They often face challenges in long-horizon scenarios due to limitations in immediate observation and vision-language modality gaps. To overcome this, we present VISTA, a novel framework that employs an ‘imagine-and-align navigation strategy. Specifically, we leverage the generative prior of pre-trained diffusion models for dynamic visual imagination conditioned on both local observations and high-level language instructions. A Perceptual Alignment Filter module then grounds these goal imaginations against current observations, guiding an interpretable and structured reasoning process for action selection. Experiments show that VISTA sets new state-of-the-art results on Room-to-Room (R2R) and RoboTHOR benchmarks, e.g., +3.6% increase in Success Rate on R2R. Extensive ablation analysis underscores the value of integrating forward-looking imagination, perceptual alignment, and structured reasoning for robust navigation in long-horizon environments. Key Words: Vision-and-Language Navigation, Diffusion Models, Vision Language Models

Index terms

Motion and Path Planning Deep Learning Methods Vision-Based Navigation