SFCo-Nav: Efficient Zero-Shot Visual Language Navigation Via Collaboration of Slow LLM and Fast Attributed Graph Alignment
Chaoran Xiong, Litao Wei, Xinhao Hu, Kehui Ma, Ziyi Xia, Zixin Jiang, Zhen Sun, Ling Pei
AI summary
Problem
Existing zero-shot visual language navigation methods rely on exhaustive per-step VLM-LLM inference, resulting in high latency and computational costs that hinder real-time deployment.
Approach
The framework pairs a slow LLM planner that generates strategic subgoals and imagined object graphs with a fast reactive navigator that executes real-time actions, linked by an asynchronous bridge that triggers the LLM only when navigation confidence drops.
Key results
- Matches or exceeds state-of-the-art success rates on R2R and REVERIE benchmarks
- Reduces trajectory token consumption by over 50%
- Achieves inference speed more than 3.5× faster
- Validated on a legged robot in a real-world hotel suite
Why it matters
Provides a computationally efficient, real-time capable architecture for deploying zero-shot embodied navigation on physical robots without task-specific training.
Abstract
Recent advances in large vision-language models (VLMs) and large language models (LLMs) have enabled zero- shot approaches to visual language navigation (VLN), where an agent follows natural language instructions using only ego perception and reasoning. However, existing zero-shot methods typically construct a naive observation graph and perform per-step VLM–LLM inference on it, resulting in high latency and computation costs that limit real-time deployment. To address this, we present SFCo-Nav, an efficient zero-shot VLN framework inspired by the principle of slow–fast cognitive collaboration. SFCo-Nav integrates three key modules: 1) a slow LLM-based planner that produces a strategic chain of subgoals, each linked to an imagined object graph; 2) a fast reactive navigator for real-time object graph construction and subgoal execution; and 3) a lightweight asynchronous slow–fast bridge aligns advanced structured, attributed imagined and perceived graphs to estimate navigation confidence, triggering the slow LLM planner only when necessary. To the best of our knowledge, SFCo-Nav is the first slow-fast collaboration zero- shot VLN system supporting asynchronous LLM triggering according to the internal confidence. Evaluated on the public R2R and REVERIE benchmarks, SFCo-Nav matches or ex- ceeds prior state-of-the-art zero-shot VLN success rates while cutting total token consumption per trajectory by over 50% and running more than 3.5× faster. Finally, we demonstrate SFCo-Nav on a legged robot in a hotel suite, showcasing its efficiency and practicality in indoor environments.