MM-TRELLIS: Point-Cloud Guided Multi-Modal 3D Vehicle Generation in Autonomous Driving
Xiaoguang Ren, Wenjing Yang, Long Lan
AI summary
Problem
Existing 3D vehicle generation methods struggle with in-the-wild autonomous driving data due to heavy occlusions, limited viewpoints, and reliance on single-view or neural rendering pipelines that produce low-quality meshes and lack geometric accuracy.
Approach
The method cycles multi-view images as conditioning inputs while using LiDAR point clouds as test-time geometric guidance during the diffusion denoising process, followed by an opacity-based voxel filtering step to remove artifacts and produce clean meshes.
Key results
- Zero-shot adaptation of native 3D diffusion priors to multimodal driving data
- Test-time LiDAR point cloud guidance for geometric accuracy and occlusion robustness
- Opacity-based voxel filtering strategy to suppress floaters and refine mesh fidelity
- State-of-the-art performance on the Waymo dataset in novel-view synthesis and geometric accuracy
Why it matters
Enables scalable, high-fidelity 3D vehicle asset creation for autonomous driving simulation and perception training without costly manual modeling or model retraining.
Abstract
Recovering realistic 3D vehicle models from au- tonomous driving scenes is crucial for synthesizing training data and building simulation environment. However, most existing vehicle generation methods fail to fully exploit multimodal sensors (i.e. multi-view images and LiDAR point clouds) and rely on neural rendering based reconstruction, leading to low- quality mesh. Recently, native 3D generative models have made significant progress, yet they are not built for arbitrary multi-view inputs and often struggle with in-the-wild driving images. In this work, we present MM-TRELLIS, a multi-modal version of TRELLIS for in-the-wild 3D vehicle generation that integrates LiDAR and image sensors from autonomous driving datasets into native 3D generative models. Specifically, multi- view images are cycled as conditioning inputs, while LiDAR point clouds provide test-time guidance to ensure geometric accuracy and cross-view consistency. During denoising, we first align the guidance point cloud with the model priors, then enforce consistency between the generated geometry and the guidance point cloud. Finally, we introduce a voxel filtering strategy based on the opacity of 3D Gaussian Splatting to sup- press floaters and produce clean meshes. Comprehensive exper- iments on Waymo dataset demonstrate our method outperforms existing methods in high-fidelity 3D vehicle generation. Code is available at https://github.com/HongliXiao/MM-TRELLIS.