← Back ICRA 2026

LISN: Language-Instructed Social Navigation with VLM-Based Controller Modulating

Junting Chen, Yunchuan Li, Panfeng Jiang, Jiacheng Du, Zixuan Chen, Chenrui Tie, Jiajun Deng, Lin Shao

PDF

AI summary

Key figure (auto-extracted from paper)

A fast-slow hierarchical framework that uses a vision-language model to dynamically tune a classical social planner achieves a 91.3% success rate, significantly outperforming baselines in complex, instruction-following navigation tasks.

Language-instructed navigation Social navigation Vision-language models Fast-slow hierarchical control Robot benchmarking Social force model

Problem

Existing social navigation benchmarks and methods overlook high-level language-instructed social rules and scene understanding, while current VLM-based approaches struggle with real-time control due to slow inference speeds.

Approach

The authors propose a hierarchical system that decouples slow, high-level VLM reasoning from fast, low-level reactive control, using the VLM to dynamically adjust costmaps and planner parameters based on language instructions and visual context.

Key results

Introduction of LISN-Bench, the first simulation benchmark for language-instructed social navigation
Development of Social-Nav-Modulator, a fast-slow hierarchical framework decoupling VLM reasoning from reactive control
Achievement of 91.3% average success rate, surpassing competitive baselines by 63%
Demonstrated superior performance in complex tasks like following pedestrians in crowds and avoiding forbidden regions

Why it matters

Enables mobile robots to safely and intelligently navigate dynamic human environments by following complex social rules and language instructions, bridging the gap between semantic understanding and real-time robotic control.

Abstract

Towards human-robot coexistence, socially aware navigation is significant for mobile robots. Yet existing studies on this area focus mainly on path efficiency and pedestrian collision avoidance, which are essential but represent only a fraction of social navigation. Beyond these basics, robots must also comply with user instructions, aligning their actions to task goals and social norms expressed by humans. In this work, we present LISN-Bench, the first simulation-based benchmark for language-instructed social navigation. Built on Rosnav-Arena 3.0, it is the first standardized social navigation benchmark to incorporate instruction following and scene understanding across diverse contexts. To address this task, we further propose Social-Nav-Modulator, a fast–slow hierarchical system where a VLM agent modulates costmaps and controller parameters. Decoupling low-level action generation from the slower VLM loop reduces reliance on high-frequency VLM inference while improving dynamic avoidance and perception adaptability. Our method achieves an average success rate of 91.3%, which surpasses the most competitive baseline by 63%, with most of the improvements observed in challenging tasks such as following a person in a crowd and navigating while strictly avoiding instruction-forbidden regions. The project website is at: https://social-nav.github.io/LISN-project/

Index terms

Vision-Based Navigation Semantic Scene Understanding AI-Enabled Robotics