Research Analyzer
← Back ICRA 2026

StreamVLN: Streaming Vision-And-Language Navigation Via SlowFast Context Modeling

MENG WEI, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, Xihui Liu, Jiangmiao Pang

PDF

AI summary

Key figure (auto-extracted from paper)
StreamVLN enables low-latency, long-horizon vision-and-language navigation by combining a sliding-window KV cache with voxel-based spatial pruning, achieving state-of-the-art benchmark performance and successful real-world robot deployment.
Vision-and-Language Navigation Streaming Video Slow-Fast Context Spatial Pruning Embodied AI Real-Time Robotics

Problem

Current Video-LLM-based navigation methods struggle to balance fine-grained visual understanding, long-term context retention, and computational efficiency when processing continuous video streams, resulting in high latency and unbounded memory growth.

Approach

The framework employs a hybrid slow-fast context modeling strategy: a fast-streaming sliding-window KV cache ensures responsive action generation, while a slow-updating memory context compresses historical visual states using training-free voxel-based 3D spatial pruning.

Key results

  • State-of-the-art performance on VLN-CE R2R and RxR benchmarks
  • Voxel-based spatial pruning reduces visual tokens by ~30% with minimal accuracy loss
  • Real-time deployment on a Unitree Go2 robot dog with ~0.27s inference latency
  • Enhanced generalization to novel instructions via co-training with multimodal data

Why it matters

It provides a scalable, low-latency framework for real-world embodied AI, bridging the gap between high-performance offline models and continuous, resource-constrained robotic deployment.

Abstract

Vision-and-Language Navigation (VLN) in real- world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast- streaming dialogue context facilitates responsive action gener- ation through a sliding-window of multi-turn dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow- fast design, StreamVLN achieves real-time dialogues through KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE bench- marks show state-of-the-art performance with low latency, ensuring robustness and efficiency in real-world deployment. The project page is: https://streamvln.github.io/.

Index terms

Vision-Based Navigation Deep Learning for Visual Perception Visual Learning

Related papers