← Back ICRA 2026

Awaken Memories with Words: Recursive Visual Imagination and Adaptive Linguistic Grounding for Vision Language Navigation

Bolei Chen, Jiaxu Kang, Yifei Wang, Ping Zhong, Jianxin Wang

PDF

AI summary

Key figure (auto-extracted from paper)

A compact, imagination-driven scene representation paired with instruction-aware alignment significantly improves vision-language navigation accuracy and efficiency.

Vision Language Navigation Implicit Scene Representation Visual Imagination Linguistic Grounding Embodied AI Neural Grids

Problem

Current VLN agents rely on overly detailed scene representations and coarse vision-language alignment, which obscures high-level navigation priors and causes command violations.

Approach

The method models historical observations as fixed-size neural grids and uses recursive visual imagination to extract semantic layouts, while adaptively aligning decoupled instruction components with situational memories.

Key results

State-of-the-art performance on VLN-CE and ObjectNav benchmarks
Recursive visual imagination effectively filters irrelevant geometric details
Adaptive linguistic grounding enables precise, component-wise vision-language alignment
Ablation studies validate the individual contributions of RVI and ALG modules

Why it matters

Provides a scalable, human-inspired memory mechanism that enhances long-sequence decision-making and instruction following for embodied agents in complex 3D environments.

Abstract

Vision Language Navigation (VLN) typically re- quires agents to navigate to specified objects or remote regions in unknown scenes by obeying linguistic commands. Such tasks require organizing historical visual observations for linguistic grounding, which is critical for long-sequence navigational decisions. However, current agents suffer from overly detailed scene representation and ambiguous vision-language alignment, which weaken their comprehension of navigation-friendly high- level scene priors and easily lead to behaviors that violate linguistic commands. To tackle these issues, we propose a navigation policy by recursively summarizing along-the-way visual perceptions, which are adaptively aligned with commands to enhance linguistic grounding. In particular, by structurally modeling historical trajectories as compact neural grids, several Recursive Visual Imagination (RVI) techniques are proposed to motivate agents to focus on the regularity of visual transitions and semantic scene layouts, instead of dealing with misleading geometric details. Then, an Adaptive Linguistic Grounding (ALG) technique is proposed to align the learned situational memories with different linguistic components purposefully. Such fine-grained semantic matching facilitates the accurate anticipation of navigation actions and progress. Our navigation policy outperforms the state-of-the-art methods on the chal- lenging VLN-CE and ObjectNav tasks, showing the superiority of our RVI and ALG techniques for VLN.

Index terms

Semantic Scene Understanding Embodied Cognitive Science Representation Learning