← Back ICRA 2026

Memory-Maze: Scenario Driven Visual Language Navigation Benchmark for Guiding Blind People

Renato Alexandre Ribeiro, Simon Chu, and Shigeo Morishima

PDF

AI summary

Key figure (auto-extracted from paper)

Memory-based route instructions from sighted passersby are significantly longer, more ambiguous, and harder for VLN models to interpret than traditional think-out-loud instructions.

Visual Language Navigation Assistive Robotics Memory-Based Instructions Benchmarking LLM Navigation Blind Guidance

Problem

Existing VLN benchmarks assume instructions are given while thinking out loud or using prebuilt maps, failing to address the real-world scenario where blind people ask sighted passersby for directions in unfamiliar, maze-like public spaces. This gap leaves a lack of realistic benchmarks and models for memory-based, map-less navigation guidance.

Approach

The authors created the Memory-Maze benchmark, featuring a CARLA-simulated maze-like public environment and two instruction datasets (online think-out-loud vs. onsite memory-based), alongside a single-inference LLM-powered baseline model that converts instructions directly into navigation code without iterative steps or prior maps.

Key results

Construction of the Memory-Maze benchmark with CARLA-simulated public spaces
Collection of dual instruction datasets highlighting memory-based vs. think-out-loud differences
Development of a single-inference LLM baseline model for map-less navigation
Demonstration that state-of-the-art VLN models struggle with memory-based instruction ambiguity

Why it matters

Provides a critical new benchmark and baseline for developing robust, map-less VLN systems that can assist blind individuals in navigating unfamiliar public spaces using natural, memory-based directions.

Abstract

Visual Language Navigation (VLN) powered robots have the potential to guide blind people by understanding route instructions provided by sighted passersby. This capability allows robots to operate in environments often unknown a prior. Existing VLN models are insufficient for the scenario of navigation guidance for blind people, as they need to understand routes described from human memory, which frequently contains stutters, errors, and omissions of details, as opposed to those obtained by thinking out loud, such as in the R2R dataset. However, existing benchmarks do not contain instructions ob- tained from human memory in natural environments. To this end, we present our benchmark, Memory-Maze, which simulates the scenario of seeking route instructions for guiding blind people. Our benchmark contains a maze-like structured virtual envi- ronment and novel route instruction data from human memory. Our analysis demonstrates that instruction data collected from memory was longer and contained more varied wording. We further demonstrate that addressing errors and ambiguities from memory-based instructions is challenging, by evaluating state-of- the-art models alongside our baseline model with modularized perception and controls.

Index terms

Vision-Based Navigation Performance Evaluation and Benchmarking