LISN: Language-Instructed Social Navigation with VLM-Based Controller Modulating
Junting Chen, Yunchuan Li, Panfeng Jiang, Jiacheng Du, Zixuan Chen, Chenrui Tie, Jiajun Deng, Lin Shao
AI summary
Problem
Existing social navigation benchmarks and methods overlook high-level language-instructed social rules and scene understanding, while current VLM-based approaches struggle with real-time control due to slow inference speeds.
Approach
The authors propose a hierarchical system that decouples slow, high-level VLM reasoning from fast, low-level reactive control, using the VLM to dynamically adjust costmaps and planner parameters based on language instructions and visual context.
Key results
- Introduction of LISN-Bench, the first simulation benchmark for language-instructed social navigation
- Development of Social-Nav-Modulator, a fast-slow hierarchical framework decoupling VLM reasoning from reactive control
- Achievement of 91.3% average success rate, surpassing competitive baselines by 63%
- Demonstrated superior performance in complex tasks like following pedestrians in crowds and avoiding forbidden regions
Why it matters
Enables mobile robots to safely and intelligently navigate dynamic human environments by following complex social rules and language instructions, bridging the gap between semantic understanding and real-time robotic control.
Abstract
Towards human-robot coexistence, socially aware navigation is significant for mobile robots. Yet existing studies on this area focus mainly on path efficiency and pedestrian collision avoidance, which are essential but represent only a fraction of social navigation. Beyond these basics, robots must also comply with user instructions, aligning their actions to task goals and social norms expressed by humans. In this work, we present LISN-Bench, the first simulation-based benchmark for language-instructed social navigation. Built on Rosnav-Arena 3.0, it is the first standardized social navigation benchmark to incorporate instruction following and scene understanding across diverse contexts. To address this task, we further propose Social-Nav-Modulator, a fast–slow hierarchical system where a VLM agent modulates costmaps and controller parameters. Decoupling low-level action generation from the slower VLM loop reduces reliance on high-frequency VLM inference while improving dynamic avoidance and perception adaptability. Our method achieves an average success rate of 91.3%, which surpasses the most competitive baseline by 63%, with most of the improvements observed in challenging tasks such as following a person in a crowd and navigating while strictly avoiding instruction-forbidden regions. The project website is at: https://social-nav.github.io/LISN-project/