A VLM-Drone System for Indoor Navigation Assistance with Semantic Reasoning for the Visually Impaired
Zezhong Zhang, Chenyu Hu, Sunwoh Lye, Chen Lv
Abstract
Reduced vision significantly impacts the daily lives of people with visual impairments (PVI), often posing challenges in navigation and spatial awareness. To enhance the semantic reasoning capabilities of assistive technologies, we have developed a guidance system that integrates large vision- language models (VLMs) with a collision-avoidance drone. This system provides navigational assistance in indoor environments by interpreting semantic wayfinding signs. At the software level, we propose a hierarchical cross-prompt VLM (HCP-VLM) structure that leverages both Claude 3.5 Sonnet and ChatGPT 4o1. This structure improves the reasoning accuracy of semantic wayfinding signs to 76.73%, outperforming the standalone accuracies of Claude (74.73%) and ChatGPT (66.35%). A specialized wayfinding sign dataset was developed to fine-tune and evaluate the VLM. At the hardware level, an ultralight dual-modal Time of Flight (TOF) Laser-Camera module was integrated into the drone to detect obstacles, track users, and identify signs. Additionally, a vibration module was designed to communicate orientation and mobility information to users. The system’s performance was evaluated in unfamiliar office buildings with two blindfolded sighted subjects, both of whom successfully located their target rooms with assistance from the system. To further drive innovation, we have released the dataset and code for public access2, aiming to inspire advancements in intelligent assistive technologies.