← Back ICRA 2026

SA-VLM V2: Useful, Comprehensive, and Concise Guidance for Guide-Dog Robots Assisting the Visually Impaired

Woo-han Yun, JaeHo Shin, BeomSu Seo, Jaehong Kim, ByungOk Han

PDF

AI summary

Key figure (auto-extracted from paper)

SA-VLMv2, a template-fine-tuned vision-language model, generates significantly more useful, comprehensive, and concise walking guidance for visually impaired users than state-of-the-art proprietary and open-source models.

Guide-dog robots Visually impaired assistance Vision-language models Walking guidance Human-robot interaction Template-based generation

Problem

Existing vision-language models fail to deliver walking guidance that is simultaneously useful, comprehensive, and concise for visually impaired users, often producing unstructured or overly detailed explanations that hinder navigation.

Approach

Researchers surveyed professional guide-dog trainers to derive four canonical guidance templates, curated a dataset of 19,945 manually annotated samples aligned with these formats, and fine-tuned the Qwen2.5VL model to generate structured walking instructions.

Key results

Derived four expert-informed canonical templates for walking guidance
Curated a dataset of 19,945 manually annotated, template-aligned guidance samples
Fine-tuned Qwen2.5VL into SA-VLMv2, outperforming proprietary MLLMs and base VLMs
Achieved higher scores across usefulness, comprehensibility, and conciseness metrics

Why it matters

Advances assistive robotics by providing visually impaired individuals with reliable, robot-delivered navigation support that enhances outdoor mobility and safety.

Abstract

The development of guide dog robots is expected to enhance the mobility and safety of visually impaired individuals outdoors. To assist these users in real-world navigation, walking guidance should be useful, comprehensive, and concise so that instructions are both actionable and easy to follow. While recent VLMs show promising capabilities in scene understanding, existing approaches do not address the effective delivery of guidance for visually impaired users. In this work, we propose SA-VLMv2 (Space-Aware VLM), a model designed to generate useful, comprehensive, and concise walking guidance based on ego-centric scenes and target destinations. To this end, we first derived four canonical templates for walking guidance through user evaluation with professional guide dog trainers across diverse images, providing insights into preferred guidance formats. We then collected, manually annotated, curated a dataset of 19,945 samples aligned with these templates and trained SA-VLMv2 from the open-sourced VLM, Qwen2.5VL. Experimental results show that SA-VLMv2 outperforms state- of-the-art proprietary MLLMs (Claude 3.5 Sonnet, Gemini 2.5, GPT-4o) and the open-sourced pretrained VLM (Qwen2.5VL) in both holistic and factor-wise evaluations. SA-VLMv2 gen- erated more concise yet informative guidance while achieving higher scores across multiple evaluation factors.

Index terms

Multi-Modal Perception for HRI Robot Companions Social HRI