← Back ICRA 2026

ImagiDrive: A Unified Imagination-And-Planning Framework for Autonomous Driving

Jingyu Li, Bozhou Zhang, XIN JIN, Jiankang Deng, Xiatian Zhu, Li Zhang

PDF

AI summary

Key figure (auto-extracted from paper)

Integrating a Vision-Language Model with a Driving World Model in a recurrent loop significantly improves planning accuracy and reduces collision rates in autonomous driving.

Autonomous driving Vision-Language Models Driving World Models Imagination-and-Planning Trajectory Prediction Closed-loop evaluation

Problem

Current autonomous driving methods lack holistic scene understanding and causal reasoning, while effectively combining Vision-Language Models for planning with Driving World Models for scene generation remains challenging due to alignment and computational efficiency constraints.

Approach

ImagiDrive uses a VLM agent to predict initial trajectories that guide a world model to generate future scenes; these imagined frames are iteratively fed back to refine planning, with early stopping and trajectory selection ensuring efficiency and safety.

Key results

Proposes a recurrent imagination-and-planning framework coupling VLM agents with DWM imaginers
Introduces a trajectory buffer with convergence-based early stopping and directional consistency selection
Achieves state-of-the-art planning scores and significantly lower collision rates on NeuroNCAP, nuScenes, and NAVSIM
Demonstrates robust closed-loop and open-loop performance across LLaVA and InternVL backbones

Why it matters

It provides a practical, computationally efficient pathway to merge cognitive reasoning with predictive simulation, advancing safer autonomous driving for researchers and industry developers.

Abstract

Autonomous driving requires rich contextual com- prehension and precise predictive reasoning to navigate dy- namic and complex environments safely. Vision-Language Mod- els (VLMs) and Driving World Models (DWMs) have inde- pendently emerged as powerful recipes addressing different aspects of this challenge. VLMs provide interpretability and robust action prediction through their ability to understand multi-modal context, while DWMs excel in generating de- tailed and plausible future driving scenarios essential for proactive planning. Integrating VLMs with DWMs is an in- tuitive, promising, yet understudied strategy to exploit the complementary strengths of accurate behavioral prediction and realistic scene generation. Nevertheless, this integration presents notable challenges, particularly in effectively connecting action- level decisions with high-fidelity pixel-level predictions and maintaining computational efficiency. In this paper, we propose ImagiDrive, a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer to form a unified imagination-and-planning loop. The driving agent predicts initial driving trajectories based on multi-modal inputs, guiding the scene imaginer to generate corresponding future scenarios. These imagined scenarios are subsequently utilized to iteratively refine the driving agent’s planning decisions. To address efficiency and predictive accuracy challenges inherent in this integration, we introduce an early stopping mechanism and a trajectory selection strategy. Extensive experimental validation on the nuScenes and NAVSIM datasets demonstrates the robustness and superiority of ImagiDrive over previous alternatives under both open-loop and closed-loop conditions.

Index terms

Autonomous Vehicle Navigation Computer Vision for Automation Integrated Planning and Learning