← Back ICRA 2026

Narrate2Nav: Real-Time Visual Navigation with Implicit Language Reasoning in Human-Centric Environments

Amirreza Payandeh, Anuj Pokhrel, Daeun Song, Marcos Zampieri, Xuesu Xiao

PDF

AI summary

Key figure (auto-extracted from paper)

Distilling chain-of-thought language reasoning into a lightweight visual encoder enables real-time, socially aware robot navigation that significantly outperforms existing baselines.

Visual Navigation Social Robotics Self-Supervised Learning Vision-Language Models Real-Time Control Barlow Twins

Problem

Current VLM-based navigation methods are too computationally heavy for real-time use, while traditional learning-based approaches lack the ability to interpret social cues and contextual reasoning from visual data alone.

Approach

The model uses a two-stage self-supervised training process where a large teacher network learns from RGB, motion commands, and chain-of-thought text descriptions, then distills this multi-modal reasoning into a lightweight student network via Barlow Twins loss for real-time RGB-only inference.

Key results

52.94% offline performance gain over next best baseline
41.67% real-world navigation improvement
Real-time inference using only RGB history and goal coordinates
Lightweight 29M-parameter student model distills multi-modal reasoning

Why it matters

Provides a practical, computationally efficient pathway for deploying socially intelligent navigation in real-world mobile robotics without heavy VLM inference costs.

Abstract

Large Vision-Language Models (VLMs) have demonstrated potential in enhancing mobile robot navigation in human-centric environments by understanding contextual cues, human intentions, and social dynamics while exhibiting reasoning capabilities. However, their computational complexity and limited sensitivity to continuous numerical data impede real-time performance and precise motion control. To this end, we propose NARRATE2NAV, a real-time vision-action model that leverages a self-supervised learning framework based on the Barlow Twins redundancy reduction loss to embed implicit reasoning-informed language supervision, social cues, and human intentions within a visual encoder. The model combines RGB inputs, motion commands, and textual signals of scene context during training to bridge from robot observations to low-level motion commands for short-horizon point-goal navigation during deployment. Extensive evaluation of NARRATE2NAV across diverse and challenging scenarios in an unseen offline dataset, complemented by a small-scale real- world experiment, demonstrates a 52.94% improvement over the next best baseline in offline testing, with consistent gains observed in real-world evaluations.

Index terms

Imitation Learning Visual Learning Representation Learning