Sce2DriveX: A Generalized MLLM Framework for Scene-To-Drive Learning
Zhao Rui, Yuan Qirui, li jinyu, Hu Haofeng, Li Yun, Gao Zhenhai, Gao Fei
AI summary
Problem
Current autonomous driving systems struggle to generalize across diverse traffic scenarios and lack interpretable, human-aligned reasoning due to a gap between high-level semantic understanding and low-level motion control.
Approach
The authors propose Sce2DriveX, a multimodal LLM framework that jointly processes multi-view video and Bird's Eye View maps to model long-range spatiotemporal relationships, reconstructing human driving cognition through a chain-of-thought reasoning pipeline and a novel VQA dataset.
Key results
- Achieves state-of-the-art performance on scene understanding and end-to-end driving tasks
- Demonstrates robust cross-scene generalization on the CARLA Bench2Drive benchmark
- Introduces the first comprehensive VQA driving instruction dataset for 3D spatial reasoning
- Enables interpretable, human-consistent driving decisions through chain-of-thought reasoning
Why it matters
Provides a scalable, interpretable foundation for next-generation autonomous driving systems that generalize across complex, real-world environments.
Abstract
End-to-end autonomous driving, which directly maps raw sensor inputs to low-level vehicle controls, is an crucial part of Embodied AI. Despite successes in applying Multimodal Large Language Models (MLLMs) for high-level traffic scene semantic understanding, it remains challenging to effectively translate these conceptual semantics understandings into low-level motion con- trol commands and achieve cross-scene driving generalization and consensus. We propose Sce2DriveX, a human-like chain-of-thought (CoT) driving reasoning MLLM framework, designed to achieve progressive learning from multi-view scene understanding to be- havior analysis, motion planning, and vehicle control driving pro- cess. Sce2DriveX utilizes multimodal joint learning of local scene videos and global Bird’s Eye View (BEV) maps to deeply under- stand long-range spatiotemporal relationships and road topology, enhancing its 3D dynamic/static scene perception and reasoning capabilitiesandachievingcross-scenegeneralization.Meanwhile,it reconstructs the implicit cognitive chain inherent in human driving, further enhancing the consensus between autonomous driving and human thought. To improve model performance, we construct the first comprehensive Visual Question Answering (VQA) driving instruction dataset, which tailored for 3D spatial understanding and long-axis task reasoning, and introduce a task-oriented three- stage training pipeline to support supervised fine-tuning. Extensive experimentsdemonstratethatSce2DriveXachievesstate-of-the-art performance across tasks from scene understanding to end-to-end driving, as well as robust generalization in handling diverse driving scenes on the CARLA Bench2Drive benchmark.