← Back ICRA 2026

Mobile Robot Navigation Using Hand-Drawn Maps: A Vision Language Model Approach

Aaron Hao Tan, Angus Fung, Haitong Wang, Goldie Nejat

PDF

AI summary

Key figure (auto-extracted from paper)

A novel architecture leverages vision-language models to enable robots to navigate accurately using imperfect, hand-drawn maps across diverse environments and robot types.

Mobile robot navigation vision language models hand-drawn maps robot planning zero-shot navigation topological mapping

Problem

Existing map-based methods require costly, accurate maps or expert knowledge, while map-less methods demand time-consuming exploration. Previous hand-drawn map approaches fail with complex environments, multi-floor layouts, and human sketch inaccuracies like scale distortions and missing landmarks.

Approach

HAM-Nav uses pre-trained vision-language models to interpret hand-drawn sketches alongside real-time camera feeds. It aligns robot views with a dynamic topological map via selective visual prompting and infers missing landmarks using the model's common-sense knowledge.

Key results

Zero-shot generalization across diverse environments, drawing styles, and robot embodiments
Selective Visual Association Prompting (SVAP) for topological position estimation
Predictive Navigation Plan Parser (PNPP) for inferring omitted landmarks
High navigation success rates and path efficiency in simulations and real-world user studies

Why it matters

Enables cost-effective, intuitive robot navigation in dynamic or unknown environments by allowing humans to guide robots with simple sketches, bypassing complex mapping or exploration phases.

Abstract

Hand-drawn maps can be used to convey navigation instructions between humans and robots in a natural and effi- cient manner. However, these maps can often contain inaccuracies such as scale distortions and missing landmarks which present challenges for mobile robot navigation. This letter introduces a novel Hand-drawn Map Navigation (HAM-Nav) architecture that leverages pre-trained vision language models (VLMs) for robot navigation across diverse environments, hand-drawing styles, and robot embodiments, even in the presence of map inaccuracies. HAM-NavintegratesauniqueSelectiveVisualAssociationPrompt- ing approach for topological map-based position estimation and navigation planning as well as a Predictive Navigation Plan Parser to infer missing landmarks. Extensive experiments were conducted in photorealistic simulated environments, using both wheeled and legged robots, demonstrating the effectiveness of HAM-Nav in terms of navigation success rates and Success weighted by Path Length. Furthermore, a user study in real-world environments highlighted the practical utility of hand-drawn maps for robot navigation as well as successful navigation outcomes compared against a non-hand-drawn map approach.

Index terms

AI-Enabled Robotics Task and Motion Planning Vision-Based Navigation