SA-VLM V2: Useful, Comprehensive, and Concise Guidance for Guide-Dog Robots Assisting the Visually Impaired
Woo-han Yun, JaeHo Shin, BeomSu Seo, Jaehong Kim, ByungOk Han
AI summary
Problem
Existing vision-language models fail to deliver walking guidance that is simultaneously useful, comprehensive, and concise for visually impaired users, often producing unstructured or overly detailed explanations that hinder navigation.
Approach
Researchers surveyed professional guide-dog trainers to derive four canonical guidance templates, curated a dataset of 19,945 manually annotated samples aligned with these formats, and fine-tuned the Qwen2.5VL model to generate structured walking instructions.
Key results
- Derived four expert-informed canonical templates for walking guidance
- Curated a dataset of 19,945 manually annotated, template-aligned guidance samples
- Fine-tuned Qwen2.5VL into SA-VLMv2, outperforming proprietary MLLMs and base VLMs
- Achieved higher scores across usefulness, comprehensibility, and conciseness metrics
Why it matters
Advances assistive robotics by providing visually impaired individuals with reliable, robot-delivered navigation support that enhances outdoor mobility and safety.
Abstract
The development of guide dog robots is expected to enhance the mobility and safety of visually impaired individuals outdoors. To assist these users in real-world navigation, walking guidance should be useful, comprehensive, and concise so that instructions are both actionable and easy to follow. While recent VLMs show promising capabilities in scene understanding, existing approaches do not address the effective delivery of guidance for visually impaired users. In this work, we propose SA-VLMv2 (Space-Aware VLM), a model designed to generate useful, comprehensive, and concise walking guidance based on ego-centric scenes and target destinations. To this end, we first derived four canonical templates for walking guidance through user evaluation with professional guide dog trainers across diverse images, providing insights into preferred guidance formats. We then collected, manually annotated, curated a dataset of 19,945 samples aligned with these templates and trained SA-VLMv2 from the open-sourced VLM, Qwen2.5VL. Experimental results show that SA-VLMv2 outperforms state- of-the-art proprietary MLLMs (Claude 3.5 Sonnet, Gemini 2.5, GPT-4o) and the open-sourced pretrained VLM (Qwen2.5VL) in both holistic and factor-wise evaluations. SA-VLMv2 gen- erated more concise yet informative guidance while achieving higher scores across multiple evaluation factors.