← Back ICRA 2026

TransDiffuser: Diverse Trajectory Generation with Decorrelated Multi-Modal Representation for End-To-End Autonomous Driving

Xuefeng Jiang, Yuan Ma, Pengxiang LI, leimeng xu, Xin Wen, Kun Zhan, zhongpu xia, Peng Jia, Xianpeng Lang, Sheng Sun

PDF

AI summary

Key figure (auto-extracted from paper)

TransDiffuser eliminates mode collapse in diffusion-based trajectory planning by decorrelating multi-modal representations, achieving state-of-the-art diversity and performance without relying on predefined anchors or scene priors.

End-to-end autonomous driving Diffusion models Trajectory generation Mode collapse Multi-modal representation NAVSIM benchmark

Problem

Existing diffusion-based trajectory generation models suffer from mode collapse and depend on predefined trajectory anchors or scene priors, limiting their ability to generalize to unseen real-world driving scenarios.

Approach

The proposed encoder-decoder model conditions a diffusion denoising process on fused scene and motion features, augmented by a novel multi-modal representation decorrelation mechanism that enriches the latent space during training to prevent trajectory convergence.

Key results

Achieves state-of-the-art PDMS of 94.9 on the NAVSIM benchmark
Operates without predefined trajectory anchors or pre-computed scene priors
Generates significantly more diverse and plausible candidate trajectories
Introduces a computation-efficient, plug-and-play decorrelation regularization

Why it matters

Provides a scalable, prior-free framework for robust multi-mode trajectory planning, advancing the generalization and safety of end-to-end autonomous driving systems.

Abstract

In recent years, diffusion models have demon- strated remarkable potential across diverse domains, from vision generation to language modeling. Transferring its generative capabilities to modern end-to-end autonomous driving systems has also emerged as a promising direction. However, existing diffusion-based trajectory generative models often exhibit mode collapse where different random noises converge to similar trajectories after the denoising process. Therefore, state-of-the- art models often rely on anchored trajectories from predefined trajectory vocabulary or scene priors in the training set to mitigate collapse and enrich the diversity of generated trajectories, but such inductive bias are not available in real- world deployment, which can be challenged when generalizing to unseen scenarios. In this work, we investigate the possibility of effectively tackling the mode collapse challenge without the assumption of predefined trajectory vocabulary or pre-computed scene priors. Specifically, we propose TransDiffuser, an encoder- decoder based generative trajectory planning model, where the encoded scene information and motion states serve as the multi- modal conditional input of the denoising decoder. Different from existing approaches, we exploit a simple yet effective multi- modal representation decorrelation optimization mechanism during the denoising process to enrich the latent representation space which better guides the downstream generation. Without any predefined trajectory anchors or pre-computed scene priors, TransDiffuser achieves the PDMS of 94.9 on the closed-loop planning-oriented benchmark NAVSIM, surpassing previous state-of-the-art methods. Qualitative evaluation further showcases TransDiffuser generates more diverse and plausible trajectories which explore more drivable area.

Index terms

Integrated Planning and Learning Task and Motion Planning Intelligent Transportation Systems