← Back ICRA 2026

SFCo-Nav: Efficient Zero-Shot Visual Language Navigation Via Collaboration of Slow LLM and Fast Attributed Graph Alignment

Chaoran Xiong, Litao Wei, Xinhao Hu, Kehui Ma, Ziyi Xia, Zixin Jiang, Zhen Sun, Ling Pei

PDF

AI summary

Key figure (auto-extracted from paper)

SFCo-Nav cuts zero-shot navigation latency and token costs by over 50% while maintaining state-of-the-art success rates through asynchronous slow-fast cognitive collaboration.

Zero-shot VLN Slow-Fast Collaboration LLM Planning Graph Alignment Embodied Navigation Real-time Deployment

Problem

Existing zero-shot visual language navigation methods rely on exhaustive per-step VLM-LLM inference, resulting in high latency and computational costs that hinder real-time deployment.

Approach

The framework pairs a slow LLM planner that generates strategic subgoals and imagined object graphs with a fast reactive navigator that executes real-time actions, linked by an asynchronous bridge that triggers the LLM only when navigation confidence drops.

Key results

Matches or exceeds state-of-the-art success rates on R2R and REVERIE benchmarks
Reduces trajectory token consumption by over 50%
Achieves inference speed more than 3.5× faster
Validated on a legged robot in a real-world hotel suite

Why it matters

Provides a computationally efficient, real-time capable architecture for deploying zero-shot embodied navigation on physical robots without task-specific training.

Abstract

Recent advances in large vision-language models (VLMs) and large language models (LLMs) have enabled zero- shot approaches to visual language navigation (VLN), where an agent follows natural language instructions using only ego perception and reasoning. However, existing zero-shot methods typically construct a naive observation graph and perform per-step VLM–LLM inference on it, resulting in high latency and computation costs that limit real-time deployment. To address this, we present SFCo-Nav, an efficient zero-shot VLN framework inspired by the principle of slow–fast cognitive collaboration. SFCo-Nav integrates three key modules: 1) a slow LLM-based planner that produces a strategic chain of subgoals, each linked to an imagined object graph; 2) a fast reactive navigator for real-time object graph construction and subgoal execution; and 3) a lightweight asynchronous slow–fast bridge aligns advanced structured, attributed imagined and perceived graphs to estimate navigation confidence, triggering the slow LLM planner only when necessary. To the best of our knowledge, SFCo-Nav is the first slow-fast collaboration zero- shot VLN system supporting asynchronous LLM triggering according to the internal confidence. Evaluated on the public R2R and REVERIE benchmarks, SFCo-Nav matches or ex- ceeds prior state-of-the-art zero-shot VLN success rates while cutting total token consumption per trajectory by over 50% and running more than 3.5× faster. Finally, we demonstrate SFCo-Nav on a legged robot in a hotel suite, showcasing its efficiency and practicality in indoor environments.

Index terms

Vision-Based Navigation Embodied Cognitive Science Agent-Based Systems