Memory-Maze: Scenario Driven Visual Language Navigation Benchmark for Guiding Blind People
Renato Alexandre Ribeiro, Simon Chu, and Shigeo Morishima
AI summary
Problem
Existing VLN benchmarks assume instructions are given while thinking out loud or using prebuilt maps, failing to address the real-world scenario where blind people ask sighted passersby for directions in unfamiliar, maze-like public spaces. This gap leaves a lack of realistic benchmarks and models for memory-based, map-less navigation guidance.
Approach
The authors created the Memory-Maze benchmark, featuring a CARLA-simulated maze-like public environment and two instruction datasets (online think-out-loud vs. onsite memory-based), alongside a single-inference LLM-powered baseline model that converts instructions directly into navigation code without iterative steps or prior maps.
Key results
- Construction of the Memory-Maze benchmark with CARLA-simulated public spaces
- Collection of dual instruction datasets highlighting memory-based vs. think-out-loud differences
- Development of a single-inference LLM baseline model for map-less navigation
- Demonstration that state-of-the-art VLN models struggle with memory-based instruction ambiguity
Why it matters
Provides a critical new benchmark and baseline for developing robust, map-less VLN systems that can assist blind individuals in navigating unfamiliar public spaces using natural, memory-based directions.
Abstract
Visual Language Navigation (VLN) powered robots have the potential to guide blind people by understanding route instructions provided by sighted passersby. This capability allows robots to operate in environments often unknown a prior. Existing VLN models are insufficient for the scenario of navigation guidance for blind people, as they need to understand routes described from human memory, which frequently contains stutters, errors, and omissions of details, as opposed to those obtained by thinking out loud, such as in the R2R dataset. However, existing benchmarks do not contain instructions ob- tained from human memory in natural environments. To this end, we present our benchmark, Memory-Maze, which simulates the scenario of seeking route instructions for guiding blind people. Our benchmark contains a maze-like structured virtual envi- ronment and novel route instruction data from human memory. Our analysis demonstrates that instruction data collected from memory was longer and contained more varied wording. We further demonstrate that addressing errors and ambiguities from memory-based instructions is challenging, by evaluating state-of- the-art models alongside our baseline model with modularized perception and controls.