← Back ICRA 2026

Probing Multimodal LLMs As World Models for Driving

Shiva Sreeram, Tsun-Hsuan Wang, Alaa Maalouf, Guy Rosman, Sertac Karaman, Daniela Rus

PDF

AI summary

Key figure (auto-extracted from paper)

Current multimodal LLMs excel at single-image analysis but fail to reason coherently over sequential driving frames, making them unreliable as dynamic driving world models.

Multimodal LLMs Autonomous Driving World Models Temporal Reasoning Driving Benchmark Closed-loop Simulation

Problem

The applicability of Multimodal Large Language Models (MLLMs) as driving world models remains untested in dynamic, closed-loop driving scenarios, leaving a gap in understanding their ability to reason over sequential visual data for autonomous driving tasks.

Approach

The authors introduce the EVAL-LLM-DRIVE dataset and DRIVESIM simulator to evaluate leading MLLMs on real and simulated driving footage, testing their reasoning capabilities across ego-car dynamics, other road actors, trajectory planning, and open-set scene reasoning.

Key results

MLLMs exhibit strong forward-motion bias, with GPT-4o predicting forward movement in 75.8% of cases
Models achieve only ~50% accuracy on ego-car dynamics like acceleration and turning
GPT-4o improves over GPT-4V in detecting other road actors but still fails at trajectory planning
Introduction of the EVAL-LLM-DRIVE dataset and DRIVESIM simulator for standardized benchmarking

Why it matters

Identifies critical temporal reasoning gaps in top MLLMs, guiding future research toward reliable, dynamic scene understanding for safe autonomous driving systems.

Abstract

We provide a sober look at the application of Multi- modal Large Language Models (MLLMs) in autonomous driving, challenging common assumptions about their ability to interpret dynamic driving scenarios. Despite advances in models like GPT- 4o, their performance in complex driving environments remains largely unexplored. Our experimental study assesses various MLLMs as world models using in-car camera perspectives and reveals that while these models excel at interpreting individual images, they struggle to synthesize coherent narratives across frames, leading to considerable inaccuracies in understanding (i) ego vehicle dynamics, (ii) interactions with other road actors, (iii) trajectory planning, and (iv) open-set scene reasoning. We introduce the EVAL-LLM-DRIVE dataset and DRIVESIM simulator to enhance our evaluation, highlighting gaps in current MLLM capabilities and the need for improved models in dynamic real-world environments.

Index terms

Performance Evaluation and Benchmarking Data Sets for Robotic Vision Autonomous Vehicle Navigation