Structured Labeling Enables Faster Vision-Language Models for End-To-End Autonomous Driving
Hao Jiang, Chuan Hu, Yukang Shi, Yuan He, Ke Wang, Xi Zhang, Zhipeng Zhang
AI summary
Problem
Current vision-language models for autonomous driving rely on unstructured text annotations and massive parameter counts (>7B), creating high computational costs and slow inference that hinder real-world deployment.
Approach
The authors introduce NuScenes-S, a structured and concise benchmark dataset, alongside FastDrive, a compact 0.9B parameter VLM that processes structured inputs and generates machine-friendly driving decisions via chain-of-thought reasoning.
Key results
- Introduced NuScenes-S structured benchmark dataset
- Developed FastDrive, a 0.9B parameter VLM baseline
- Achieved ~20% accuracy gain on decision-making tasks
- Delivered >10× inference speedup over massive baselines
Why it matters
Enables efficient, real-time deployment of reasoning-capable autonomous driving systems by drastically reducing computational overhead without sacrificing decision accuracy.
Abstract
Vision-Language Models (VLMs) offer a promis- ing approach to end-to-end autonomous driving due to their human-like reasoning capabilities. However, troublesome gaps remains between current VLMs and real-world autonomous driving applications. One major limitation is that existing datasets with loosely formatted language descriptions are not machine-friendly and may introduce redundancy. Additionally, high computational cost and massive scale of VLMs hinder the inference speed and real-world deployment. To bridge the gap, this paper introduces a structured and concise benchmark dataset, NuScenes-S, which is derived from the NuScenes dataset and contains machine-friendly structured representa- tions. Moreover, we present FastDrive, a compact VLM baseline with 0.9B parameters. In contrast to existing VLMs with over 7B parameters and unstructured language processing(e.g., LLaVA-1.5), FastDrive understands structured and concise descriptions and generates machine-friendly driving decisions with high efficiency. Extensive experiments show that FastDrive achieves competitive performance on structured dataset, with approximately 20% accuracy improvement on decision-making tasks, while surpassing massive parameter baseline in inference speed with over 10× speedup. Additionally, ablation studies further focus on the impact of scene annotations (e.g., weather, time of day) on decision-making tasks, demonstrating their importance on decision-making tasks in autonomous driving.