← Back ICRA 2026

Time-Aware Assistive Navigation

Zhongkai Shangguan, Masaki Kuribayashi, Eshed Ohn-Bar

PDF

AI summary

Key figure (auto-extracted from paper)

Off-the-shelf multimodal language models struggle with timely assistive navigation, but explicitly supervising them to predict the reasoning behind each instruction significantly improves their guidance accuracy and timing.

Assistive Navigation Multimodal LLMs Temporal Reasoning TIMELI Benchmark Vision-Language Navigation AI Safety

Problem

Current multimodal language models lack the temporal planning and situational awareness needed to provide safe, timely navigation guidance for visually impaired users, often delivering distracting or poorly timed instructions in dynamic urban settings.

Approach

The authors introduce TIMELI, a large-scale benchmark for evaluating time-aware assistive navigation, and enhance MLLMs by adding direct supervision to predict the underlying reason for each instruction before generating it.

Key results

Introduction of TIMELI, a large-scale multimodal benchmark for time-aware assistive navigation
Demonstration that off-the-shelf MLLMs fail at safe, timely instruction generation even after fine-tuning
Significant performance gains across open-loop, closed-loop, and sim-to-real settings with explicit reason-prediction supervision
Release of simulation environment, benchmark dataset, and code for scalable assistive agent development

Why it matters

It advances the development of safe, real-time assistive AI for visually impaired individuals and highlights critical gaps in temporal reasoning for multimodal models.

Abstract

Can interactive vision-and-language agents learn not just what to say but also when to say it? Current language models rarely plan over whether and when to realize a real-time response to a user. However, providing accurate and timely support for human decision-making, such as when guiding visu- ally impaired individuals through urban environments, requires careful real-time responsiveness–poorly timed responses can distract users or add unnecessary cognitive load. As a machine intelligence challenge for Multimodal Large Language Model (MLLM)-based agents, we introduce a large-scale multimodal benchmark for an egocentric, assistive navigation task in complex outdoor environments. Using this benchmark, we uncover a fundamental limitation of off-the-shelf MLLMs in delivering safe and time-sensitive navigation instructions, even with model fine-tuning on substantial amounts of data. We then demonstrate that a simple yet effective modification of the model, including direct supervision to predict the underlying reason for each instruction, yields significant performance gains across open-loop, closed-loop, and sim-to-real generalization settings. However, our analysis highlights persistent challenges in temporal reasoning, safety-critical object awareness, and relational and distance understanding. To advance the development of scalable assistive agents, we will release our simulation, benchmark, and code (available at project website: https://timeli-icra.github.io/).

Index terms

Intelligent Transportation Systems Robotics and Automation in Agriculture and Forestry Robotics and Automation in Construction