← Back ICRA 2026

GSUC-VLM: Geometrically-Guided Spatial Understanding Chain of Vision Language Model for Autonomous Driving

Yifan Zhao, Ziyang Zheng, Congjia Chen, Shizhuo Zhang, Huixin Zhang, Wenrui Dai, Fan He, Hongkai Xiong

PDF

AI summary

Key figure (auto-extracted from paper)

GSUC-VLM achieves state-of-the-art zero-shot and fine-tuned Visual Question Answering performance in autonomous driving by unifying semantic and spatial features without external 3D data.

Autonomous Driving Vision-Language Models Spatial Understanding Multi-View Fusion Visual Question Answering Zero-Shot Generalization

Problem

Existing Vision-Language Models lack robust spatial understanding for multi-view driving scenes, often relying on misaligned external modalities like point clouds or detection priors that degrade scalability and semantic capabilities.

Approach

The method uses a dual-encoder to extract semantic and spatial features from multi-view images, fuses them with a lightweight connector, and injects camera projection matrix encodings and 3D position embeddings to align features via distillation loss.

Key results

Achieves state-of-the-art accuracy on NuScenes-QA and DriveLMM-o1 benchmarks.
Demonstrates strong zero-shot generalization on DriveBench without task-specific fine-tuning.
Provides interpretable Chain-of-Thought reasoning for spatial understanding.
Preserves pre-trained VLM semantic capabilities while enhancing multi-view spatial alignment.

Why it matters

Enables safer, more reliable autonomous driving systems by providing accurate, context-aware spatial reasoning directly from multi-camera inputs without costly 3D sensor dependencies.

Abstract

Robust spatial understanding is crucial for Visual Question Answering (VQA) in autonomous driving that aims to enhance decision-making, reduce positional risks, and ensure road safety by providing answers based on the perception, prediction, and planning of driving scenarios. Despite remark- able success in semantic understanding of images and videos, existing Vision-Language Models (VLMs), as the prevailing paradigms for VQA, are limited in spatial understanding for multi-view scenes due to the lack of latent unified 3D recon- struction capability. They usually resort to additional spatial modalities such as point clouds or prior detection frameworks to enhance spatial understanding ability, but are still challenged by modality misalignment and degraded scalability. To overcome these limitations, in this paper, we propose a Geometrically- Guided Spatial Understanding Chain Framework (GSUC- VLM) for autonomous driving that leverages pretrained VLMs to jointly exploit semantic and spatial information in multi-view images. Specifically, we first design a dual-encoder architecture to fuse the semantic and spatial features separately extracted from multi-view images with a lightweight connector rather than introducing external spatial modalities. Subsequently, we align semantic and spatial features via distillation loss to generate semantic tokens enriched with the spatial information at the latent layer. Furthermore, we develop a projective feature conditioning method that incorporates camera intrinsic and extrinsic parameters to embed projection matrix encoding into the input vectors and introduce 3D position embeddings into the fusion layer for capturing complex spatial relationship across multiple views in autonomous driving. Experimental results show that the proposed GSUC-VLM achieves state-of-the-art performance in VQA tasks while providing Chain-of-Thought (CoT) understanding. Remarkably, GSUC-VLM demonstrates strong generalization on zero-shot VQA tasks.

Index terms

Deep Learning for Visual Perception Computer Vision for Automation