← Back ICRA 2026

Sce2DriveX: A Generalized MLLM Framework for Scene-To-Drive Learning

Zhao Rui, Yuan Qirui, li jinyu, Hu Haofeng, Li Yun, Gao Zhenhai, Gao Fei

PDF

AI summary

Key figure (auto-extracted from paper)

Sce2DriveX bridges high-level scene understanding and low-level vehicle control using a multimodal LLM with human-like chain-of-thought reasoning, achieving state-of-the-art cross-scene driving generalization.

Multimodal LLMs autonomous driving chain-of-thought reasoning visual question answering end-to-end driving scene understanding

Problem

Current autonomous driving systems struggle to generalize across diverse traffic scenarios and lack interpretable, human-aligned reasoning due to a gap between high-level semantic understanding and low-level motion control.

Approach

The authors propose Sce2DriveX, a multimodal LLM framework that jointly processes multi-view video and Bird's Eye View maps to model long-range spatiotemporal relationships, reconstructing human driving cognition through a chain-of-thought reasoning pipeline and a novel VQA dataset.

Key results

Achieves state-of-the-art performance on scene understanding and end-to-end driving tasks
Demonstrates robust cross-scene generalization on the CARLA Bench2Drive benchmark
Introduces the first comprehensive VQA driving instruction dataset for 3D spatial reasoning
Enables interpretable, human-consistent driving decisions through chain-of-thought reasoning

Why it matters

Provides a scalable, interpretable foundation for next-generation autonomous driving systems that generalize across complex, real-world environments.

Abstract

End-to-end autonomous driving, which directly maps raw sensor inputs to low-level vehicle controls, is an crucial part of Embodied AI. Despite successes in applying Multimodal Large Language Models (MLLMs) for high-level traffic scene semantic understanding, it remains challenging to effectively translate these conceptual semantics understandings into low-level motion con- trol commands and achieve cross-scene driving generalization and consensus. We propose Sce2DriveX, a human-like chain-of-thought (CoT) driving reasoning MLLM framework, designed to achieve progressive learning from multi-view scene understanding to be- havior analysis, motion planning, and vehicle control driving pro- cess. Sce2DriveX utilizes multimodal joint learning of local scene videos and global Bird’s Eye View (BEV) maps to deeply under- stand long-range spatiotemporal relationships and road topology, enhancing its 3D dynamic/static scene perception and reasoning capabilitiesandachievingcross-scenegeneralization.Meanwhile,it reconstructs the implicit cognitive chain inherent in human driving, further enhancing the consensus between autonomous driving and human thought. To improve model performance, we construct the first comprehensive Visual Question Answering (VQA) driving instruction dataset, which tailored for 3D spatial understanding and long-axis task reasoning, and introduce a task-oriented three- stage training pipeline to support supervised fine-tuning. Extensive experimentsdemonstratethatSce2DriveXachievesstate-of-the-art performance across tasks from scene understanding to end-to-end driving, as well as robust generalization in handling diverse driving scenes on the CARLA Bench2Drive benchmark.

Index terms

Autonomous Vehicle Navigation Autonomous Agents Semantic Scene Understanding