Research Analyzer
← Back ICRA 2026

Mobile Robot Navigation Using Hand-Drawn Maps: A Vision Language Model Approach

Aaron Hao Tan, Angus Fung, Haitong Wang, Goldie Nejat

PDF

AI summary

Key figure (auto-extracted from paper)
A novel architecture leverages vision-language models to enable robots to navigate accurately using imperfect, hand-drawn maps across diverse environments and robot types.
Mobile robot navigation vision language models hand-drawn maps robot planning zero-shot navigation topological mapping

Problem

Existing map-based methods require costly, accurate maps or expert knowledge, while map-less methods demand time-consuming exploration. Previous hand-drawn map approaches fail with complex environments, multi-floor layouts, and human sketch inaccuracies like scale distortions and missing landmarks.

Approach

HAM-Nav uses pre-trained vision-language models to interpret hand-drawn sketches alongside real-time camera feeds. It aligns robot views with a dynamic topological map via selective visual prompting and infers missing landmarks using the model's common-sense knowledge.

Key results

  • Zero-shot generalization across diverse environments, drawing styles, and robot embodiments
  • Selective Visual Association Prompting (SVAP) for topological position estimation
  • Predictive Navigation Plan Parser (PNPP) for inferring omitted landmarks
  • High navigation success rates and path efficiency in simulations and real-world user studies

Why it matters

Enables cost-effective, intuitive robot navigation in dynamic or unknown environments by allowing humans to guide robots with simple sketches, bypassing complex mapping or exploration phases.

Abstract

Hand-drawn maps can be used to convey navigation instructions between humans and robots in a natural and effi- cient manner. However, these maps can often contain inaccuracies such as scale distortions and missing landmarks which present challenges for mobile robot navigation. This letter introduces a novel Hand-drawn Map Navigation (HAM-Nav) architecture that leverages pre-trained vision language models (VLMs) for robot navigation across diverse environments, hand-drawing styles, and robot embodiments, even in the presence of map inaccuracies. HAM-NavintegratesauniqueSelectiveVisualAssociationPrompt- ing approach for topological map-based position estimation and navigation planning as well as a Predictive Navigation Plan Parser to infer missing landmarks. Extensive experiments were conducted in photorealistic simulated environments, using both wheeled and legged robots, demonstrating the effectiveness of HAM-Nav in terms of navigation success rates and Success weighted by Path Length. Furthermore, a user study in real-world environments highlighted the practical utility of hand-drawn maps for robot navigation as well as successful navigation outcomes compared against a non-hand-drawn map approach.

Index terms

AI-Enabled Robotics Task and Motion Planning Vision-Based Navigation

Related papers