TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation
Jiaxing Liu, Zexi Zhang, Xiaoyan Li, Boyue Wang, Yongli Hu, Baocai Yin
AI summary
Problem
Existing large-model-based VLN methods lose crucial visual-spatial information by converting observations to text or lack explicit global memory, limiting their ability to reason over topological structures and backtrack effectively.
Approach
TagaVLM is an end-to-end framework that embeds an online topological map directly into a VLM backbone using an Interleaved Navigation Prompt for node alignment and Spatial Topology Aware Residual Attention to inject edge distances into self-attention layers.
Key results
- State-of-the-art performance on R2R benchmark (SR: 51.09%, SPL: 47.18 in unseen environments)
- Outperforms prior large-model methods by 3.39% SR and 9.08 SPL
- Architecturally injected topological priors in a 0.5B VLM yield competitive results
- Enables effective global action reasoning and path backtracking during navigation
Why it matters
Demonstrates that targeted architectural inductive biases outperform brute-force scaling for embodied spatial reasoning, guiding the development of efficient, robust autonomous navigation systems.
Abstract
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch: VLMs are primarily pretrained on static, disembodied vision-language tasks, which fundamentally clash with the dynamic, embodied, and spatially-structured nature of navigation. Existing large- model-based methods often resort to converting rich visual and spatial information into text, forcing models to implicitly infer complex visual-topological relationships or limiting their global action capabilities. To bridge this gap, we propose TagaVLM (Topology-Aware Global Action reasoning), an end- to-end framework that explicitly injects topological structures into the VLM backbone. To introduce topological edge infor- mation, Spatial Topology Aware Residual Attention (STAR-Att) directly integrates it into the VLM’s self-attention mechanism, enabling intrinsic spatial reasoning while preserving pretrained knowledge. To enhance topological node information, an In- terleaved Navigation Prompt strengthens node-level visual-text alignment. Finally, with the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction. On the R2R benchmark, TagaVLM achieves state-of-the-art performance among large-model-based methods, with a Success Rate (SR) of 51.09% and SPL of 47.18 in unseen environments, outperforming prior work by 3.39% in SR and 9.08 in SPL. This demonstrates that, for embodied spatial reasoning, targeted enhancements on smaller open-source VLMs can be more effective than brute-force model scaling. The code can be found on our project page: https://apex-bjut.github.io/Taga-VLM/.