Narrate2Nav: Real-Time Visual Navigation with Implicit Language Reasoning in Human-Centric Environments
Amirreza Payandeh, Anuj Pokhrel, Daeun Song, Marcos Zampieri, Xuesu Xiao
AI summary
Problem
Current VLM-based navigation methods are too computationally heavy for real-time use, while traditional learning-based approaches lack the ability to interpret social cues and contextual reasoning from visual data alone.
Approach
The model uses a two-stage self-supervised training process where a large teacher network learns from RGB, motion commands, and chain-of-thought text descriptions, then distills this multi-modal reasoning into a lightweight student network via Barlow Twins loss for real-time RGB-only inference.
Key results
- 52.94% offline performance gain over next best baseline
- 41.67% real-world navigation improvement
- Real-time inference using only RGB history and goal coordinates
- Lightweight 29M-parameter student model distills multi-modal reasoning
Why it matters
Provides a practical, computationally efficient pathway for deploying socially intelligent navigation in real-world mobile robotics without heavy VLM inference costs.
Abstract
Large Vision-Language Models (VLMs) have demonstrated potential in enhancing mobile robot navigation in human-centric environments by understanding contextual cues, human intentions, and social dynamics while exhibiting reasoning capabilities. However, their computational complexity and limited sensitivity to continuous numerical data impede real-time performance and precise motion control. To this end, we propose NARRATE2NAV, a real-time vision-action model that leverages a self-supervised learning framework based on the Barlow Twins redundancy reduction loss to embed implicit reasoning-informed language supervision, social cues, and human intentions within a visual encoder. The model combines RGB inputs, motion commands, and textual signals of scene context during training to bridge from robot observations to low-level motion commands for short-horizon point-goal navigation during deployment. Extensive evaluation of NARRATE2NAV across diverse and challenging scenarios in an unseen offline dataset, complemented by a small-scale real- world experiment, demonstrates a 52.94% improvement over the next best baseline in offline testing, with consistent gains observed in real-world evaluations.