TransDiffuser: Diverse Trajectory Generation with Decorrelated Multi-Modal Representation for End-To-End Autonomous Driving
Xuefeng Jiang, Yuan Ma, Pengxiang LI, leimeng xu, Xin Wen, Kun Zhan, zhongpu xia, Peng Jia, Xianpeng Lang, Sheng Sun
AI summary
Problem
Existing diffusion-based trajectory generation models suffer from mode collapse and depend on predefined trajectory anchors or scene priors, limiting their ability to generalize to unseen real-world driving scenarios.
Approach
The proposed encoder-decoder model conditions a diffusion denoising process on fused scene and motion features, augmented by a novel multi-modal representation decorrelation mechanism that enriches the latent space during training to prevent trajectory convergence.
Key results
- Achieves state-of-the-art PDMS of 94.9 on the NAVSIM benchmark
- Operates without predefined trajectory anchors or pre-computed scene priors
- Generates significantly more diverse and plausible candidate trajectories
- Introduces a computation-efficient, plug-and-play decorrelation regularization
Why it matters
Provides a scalable, prior-free framework for robust multi-mode trajectory planning, advancing the generalization and safety of end-to-end autonomous driving systems.
Abstract
In recent years, diffusion models have demon- strated remarkable potential across diverse domains, from vision generation to language modeling. Transferring its generative capabilities to modern end-to-end autonomous driving systems has also emerged as a promising direction. However, existing diffusion-based trajectory generative models often exhibit mode collapse where different random noises converge to similar trajectories after the denoising process. Therefore, state-of-the- art models often rely on anchored trajectories from predefined trajectory vocabulary or scene priors in the training set to mitigate collapse and enrich the diversity of generated trajectories, but such inductive bias are not available in real- world deployment, which can be challenged when generalizing to unseen scenarios. In this work, we investigate the possibility of effectively tackling the mode collapse challenge without the assumption of predefined trajectory vocabulary or pre-computed scene priors. Specifically, we propose TransDiffuser, an encoder- decoder based generative trajectory planning model, where the encoded scene information and motion states serve as the multi- modal conditional input of the denoising decoder. Different from existing approaches, we exploit a simple yet effective multi- modal representation decorrelation optimization mechanism during the denoising process to enrich the latent representation space which better guides the downstream generation. Without any predefined trajectory anchors or pre-computed scene priors, TransDiffuser achieves the PDMS of 94.9 on the closed-loop planning-oriented benchmark NAVSIM, surpassing previous state-of-the-art methods. Qualitative evaluation further showcases TransDiffuser generates more diverse and plausible trajectories which explore more drivable area.