← Back ICRA 2026

TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation

Jiaxing Liu, Zexi Zhang, Xiaoyan Li, Boyue Wang, Yongli Hu, Baocai Yin

PDF

AI summary

Explicitly embedding topological structures into VLMs enables robust global navigation and path correction, outperforming larger proprietary models without vision-to-text conversion.

Vision-Language Navigation Topology-Aware Reasoning Large Vision-Language Models Global Action Planning Spatial Reasoning End-to-End Navigation

Problem

Existing large-model-based VLN methods lose crucial visual-spatial information by converting observations to text or lack explicit global memory, limiting their ability to reason over topological structures and backtrack effectively.

Approach

TagaVLM is an end-to-end framework that embeds an online topological map directly into a VLM backbone using an Interleaved Navigation Prompt for node alignment and Spatial Topology Aware Residual Attention to inject edge distances into self-attention layers.

Key results

State-of-the-art performance on R2R benchmark (SR: 51.09%, SPL: 47.18 in unseen environments)
Outperforms prior large-model methods by 3.39% SR and 9.08 SPL
Architecturally injected topological priors in a 0.5B VLM yield competitive results
Enables effective global action reasoning and path backtracking during navigation

Why it matters

Demonstrates that targeted architectural inductive biases outperform brute-force scaling for embodied spatial reasoning, guiding the development of efficient, robust autonomous navigation systems.

Abstract

Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch: VLMs are primarily pretrained on static, disembodied vision-language tasks, which fundamentally clash with the dynamic, embodied, and spatially-structured nature of navigation. Existing large- model-based methods often resort to converting rich visual and spatial information into text, forcing models to implicitly infer complex visual-topological relationships or limiting their global action capabilities. To bridge this gap, we propose TagaVLM (Topology-Aware Global Action reasoning), an end- to-end framework that explicitly injects topological structures into the VLM backbone. To introduce topological edge infor- mation, Spatial Topology Aware Residual Attention (STAR-Att) directly integrates it into the VLM’s self-attention mechanism, enabling intrinsic spatial reasoning while preserving pretrained knowledge. To enhance topological node information, an In- terleaved Navigation Prompt strengthens node-level visual-text alignment. Finally, with the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction. On the R2R benchmark, TagaVLM achieves state-of-the-art performance among large-model-based methods, with a Success Rate (SR) of 51.09% and SPL of 47.18 in unseen environments, outperforming prior work by 3.39% in SR and 9.08 in SPL. This demonstrates that, for embodied spatial reasoning, targeted enhancements on smaller open-source VLMs can be more effective than brute-force model scaling. The code can be found on our project page: https://apex-bjut.github.io/Taga-VLM/.

Index terms

Vision-Based Navigation Autonomous Vehicle Navigation Deep Learning for Visual Perception