← Back ICRA 2026

VLM-E2E: Enhancing End-To-End Autonomous Driving with Multimodal Driver Attention Fusion

Pei Liu, Haipeng Liu, Haichao Liu, Xin LIU, Jinxin Ni, Jun Ma

PDF

AI summary

Key figure (auto-extracted from paper)

Fusing VLM-derived driver attention cues with BEV features via dynamic weighting significantly improves perception, prediction, and planning in complex autonomous driving.

End-to-end driving Vision-Language Models BEV representation Multimodal fusion Driver attention Semantic reasoning

Problem

Current end-to-end autonomous driving models lose critical high-level semantics during 2D-to-3D conversion and fail to replicate the human driver's attentional focus, limiting performance in dynamic and ambiguous scenarios.

Approach

The method leverages a Vision-Language Model to generate and refine textual descriptions of driver attention from front-view images, then integrates these semantics with Bird's-Eye-View features using a learnable weighted fusion strategy.

Key results

Outperforms baseline E2E models in perception and prediction metrics on the nuScenes dataset
Dynamically balances visual and textual modality contributions through learnable fusion weights
Mitigates VLM hallucination by refining text annotations with ground truth and maneuvering data
Enhances trajectory planning robustness by explicitly modeling driver attentional semantics

Why it matters

Enables autonomous systems to incorporate human-like situational awareness, advancing safety and reliability in complex real-world driving environments.

Abstract

Human drivers adeptly navigate complex scenar- ios by utilizing rich attentional semantics, but the current autonomous systems struggle to replicate this ability, as they often lose critical semantic information when converting 2D observations into 3D space. In this sense, it hinders their effective deployment in dynamic and complex environments. Leveraging the superior scene understanding and reasoning abilities of Vision-Language Models (VLMs), we propose VLM- E2E, a novel framework that uses the VLMs to enhance training by providing attentional cues. Our method integrates textual representations into Bird’s-Eye-View (BEV) features for seman- tic supervision, which enables the model to learn richer feature representations that explicitly capture the driver’s attentional semantics. By focusing on attentional semantics, VLM-E2E bet- ter aligns with human-like driving behavior, which is critical for navigating dynamic and complex environments. Furthermore, we introduce a BEV-Text learnable weighted fusion strategy to address the issue of modality importance imbalance in fusing multimodal information. This approach dynamically balances the contributions of BEV and text features, ensuring that the complementary information from visual and textual modalities is effectively utilized. By explicitly addressing the imbalance in multimodal fusion, our method facilitates a more holistic and robust representation of driving environments. We evaluate VLM-E2E on the nuScenes dataset and achieve significant improvements in perception, prediction, and planning over the baseline end-to-end model, showcasing the effectiveness of our attention-enhanced BEV representation in enabling more accurate and reliable autonomous driving tasks.

Index terms

Sensor Fusion Deep Learning for Visual Perception Computer Vision for Automation