Time-Aware Assistive Navigation
Zhongkai Shangguan, Masaki Kuribayashi, Eshed Ohn-Bar
AI summary
Problem
Current multimodal language models lack the temporal planning and situational awareness needed to provide safe, timely navigation guidance for visually impaired users, often delivering distracting or poorly timed instructions in dynamic urban settings.
Approach
The authors introduce TIMELI, a large-scale benchmark for evaluating time-aware assistive navigation, and enhance MLLMs by adding direct supervision to predict the underlying reason for each instruction before generating it.
Key results
- Introduction of TIMELI, a large-scale multimodal benchmark for time-aware assistive navigation
- Demonstration that off-the-shelf MLLMs fail at safe, timely instruction generation even after fine-tuning
- Significant performance gains across open-loop, closed-loop, and sim-to-real settings with explicit reason-prediction supervision
- Release of simulation environment, benchmark dataset, and code for scalable assistive agent development
Why it matters
It advances the development of safe, real-time assistive AI for visually impaired individuals and highlights critical gaps in temporal reasoning for multimodal models.
Abstract
Can interactive vision-and-language agents learn not just what to say but also when to say it? Current language models rarely plan over whether and when to realize a real-time response to a user. However, providing accurate and timely support for human decision-making, such as when guiding visu- ally impaired individuals through urban environments, requires careful real-time responsiveness–poorly timed responses can distract users or add unnecessary cognitive load. As a machine intelligence challenge for Multimodal Large Language Model (MLLM)-based agents, we introduce a large-scale multimodal benchmark for an egocentric, assistive navigation task in complex outdoor environments. Using this benchmark, we uncover a fundamental limitation of off-the-shelf MLLMs in delivering safe and time-sensitive navigation instructions, even with model fine-tuning on substantial amounts of data. We then demonstrate that a simple yet effective modification of the model, including direct supervision to predict the underlying reason for each instruction, yields significant performance gains across open-loop, closed-loop, and sim-to-real generalization settings. However, our analysis highlights persistent challenges in temporal reasoning, safety-critical object awareness, and relational and distance understanding. To advance the development of scalable assistive agents, we will release our simulation, benchmark, and code (available at project website: https://timeli-icra.github.io/).