← Back ICRA 2026

V2V-LLM: Vehicle-To-Vehicle Cooperative Autonomous Driving with Multimodal Large Language Models

Hsu-kuang Chiu, Ryo Hachiuma, Chien-Yi Wang, Stephen F. Smith, Yu-Chiang Frank Wang, Min-Hung Chen

PDF

AI summary

Key figure (auto-extracted from paper)

Fusing multi-vehicle perception data through a multimodal LLM significantly outperforms traditional fusion baselines for cooperative autonomous driving tasks like grounding, object identification, and planning.

Cooperative autonomous driving Multimodal large language models Vehicle-to-vehicle communication V2V-QA dataset LiDAR perception fusion Autonomous planning

Problem

Existing cooperative driving research primarily focuses on perception tasks like detection and tracking, leaving the integration of multi-vehicle perception data with downstream planning and natural language reasoning largely unexplored.

Approach

The authors introduce the V2V-QA dataset and a baseline model that fuses LiDAR-derived scene and object features from multiple connected vehicles into a multimodal LLM to answer driving-related questions about grounding, object identification, and trajectory planning.

Key results

Creation of V2V-QA dataset with 1.45M QA pairs for grounding, object identification, and planning
Development of V2V-LLM baseline fusing multi-CAV LiDAR features via a multimodal LLM
V2V-LLM outperforms no-fusion, early-fusion, and intermediate-fusion baselines across all tasks
Establishment of a new benchmark demonstrating LLMs' potential as unified cooperative driving models

Why it matters

This work establishes a new benchmark and research direction for safe, cooperative autonomous driving by demonstrating how multimodal LLMs can unify multi-vehicle perception and planning for real-world deployment.

Abstract

Current autonomous driving vehicles rely mainly on their individual sensors to understand surrounding scenes and plan for future trajectories, which can be unreliable when the sensors are malfunctioning or occluded. To address this problem, cooperative perception methods via vehicle-to- vehicle (V2V) communication have been proposed, but they have tended to focus on perception tasks like detection or tracking. How those approaches contribute to overall coop- erative planning performance is still under-explored. Inspired by recent progress using Large Language Models (LLMs) to build autonomous driving systems, we propose a novel problem setting that integrates a Multimodal LLM into cooper- ative autonomous driving, with the proposed Vehicle-to-Vehicle Question-Answering (V2V-QA) dataset and benchmark. We also propose our baseline method Vehicle-to-Vehicle Multimodal Large Language Model (V2V-LLM), which uses an LLM to fuse perception information from multiple connected au- tonomous vehicles (CAVs) and answer various types of driving- related questions: grounding, notable object identification, and planning. Experimental results show that our proposed V2V- LLM can be a promising unified model architecture for per- forming various tasks in cooperative autonomous driving, and outperforms other baseline methods that use different fusion approaches. Our work also creates a new research direction that can improve the safety of future autonomous driving systems. Our code and dataset are released to facilitate open-source research at https://eddyhkchiu.github.io/v2vllm.github.io/.

Index terms

Computer Vision for Transportation Intelligent Transportation Systems Deep Learning for Visual Perception