← Back ICRA 2026

Structured Labeling Enables Faster Vision-Language Models for End-To-End Autonomous Driving

Hao Jiang, Chuan Hu, Yukang Shi, Yuan He, Ke Wang, Xi Zhang, Zhipeng Zhang

PDF

AI summary

Key figure (auto-extracted from paper)

Structured data and a compact 0.9B parameter VLM enable competitive autonomous driving performance with over 10× faster inference than massive models.

Vision-Language Models Autonomous Driving Structured Data Compact Models End-to-End Driving Inference Efficiency

Problem

Current vision-language models for autonomous driving rely on unstructured text annotations and massive parameter counts (>7B), creating high computational costs and slow inference that hinder real-world deployment.

Approach

The authors introduce NuScenes-S, a structured and concise benchmark dataset, alongside FastDrive, a compact 0.9B parameter VLM that processes structured inputs and generates machine-friendly driving decisions via chain-of-thought reasoning.

Key results

Introduced NuScenes-S structured benchmark dataset
Developed FastDrive, a 0.9B parameter VLM baseline
Achieved ~20% accuracy gain on decision-making tasks
Delivered >10× inference speedup over massive baselines

Why it matters

Enables efficient, real-time deployment of reasoning-capable autonomous driving systems by drastically reducing computational overhead without sacrificing decision accuracy.

Abstract

Vision-Language Models (VLMs) offer a promis- ing approach to end-to-end autonomous driving due to their human-like reasoning capabilities. However, troublesome gaps remains between current VLMs and real-world autonomous driving applications. One major limitation is that existing datasets with loosely formatted language descriptions are not machine-friendly and may introduce redundancy. Additionally, high computational cost and massive scale of VLMs hinder the inference speed and real-world deployment. To bridge the gap, this paper introduces a structured and concise benchmark dataset, NuScenes-S, which is derived from the NuScenes dataset and contains machine-friendly structured representa- tions. Moreover, we present FastDrive, a compact VLM baseline with 0.9B parameters. In contrast to existing VLMs with over 7B parameters and unstructured language processing(e.g., LLaVA-1.5), FastDrive understands structured and concise descriptions and generates machine-friendly driving decisions with high efficiency. Extensive experiments show that FastDrive achieves competitive performance on structured dataset, with approximately 20% accuracy improvement on decision-making tasks, while surpassing massive parameter baseline in inference speed with over 10× speedup. Additionally, ablation studies further focus on the impact of scene annotations (e.g., weather, time of day) on decision-making tasks, demonstrating their importance on decision-making tasks in autonomous driving.

Index terms

Computer Vision for Transportation