← Back ICRA 2026

MM-TRELLIS: Point-Cloud Guided Multi-Modal 3D Vehicle Generation in Autonomous Driving

Xiaoguang Ren, Wenjing Yang, Long Lan

PDF

AI summary

Key figure (auto-extracted from paper)

MM-TRELLIS leverages LiDAR point clouds as test-time guidance during diffusion denoising to generate geometrically accurate, high-fidelity 3D vehicle models from multi-view driving images, outperforming existing methods.

3D vehicle generation LiDAR guidance Multi-modal diffusion Autonomous driving Mesh refinement Test-time optimization

Problem

Existing 3D vehicle generation methods struggle with in-the-wild autonomous driving data due to heavy occlusions, limited viewpoints, and reliance on single-view or neural rendering pipelines that produce low-quality meshes and lack geometric accuracy.

Approach

The method cycles multi-view images as conditioning inputs while using LiDAR point clouds as test-time geometric guidance during the diffusion denoising process, followed by an opacity-based voxel filtering step to remove artifacts and produce clean meshes.

Key results

Zero-shot adaptation of native 3D diffusion priors to multimodal driving data
Test-time LiDAR point cloud guidance for geometric accuracy and occlusion robustness
Opacity-based voxel filtering strategy to suppress floaters and refine mesh fidelity
State-of-the-art performance on the Waymo dataset in novel-view synthesis and geometric accuracy

Why it matters

Enables scalable, high-fidelity 3D vehicle asset creation for autonomous driving simulation and perception training without costly manual modeling or model retraining.

Abstract

Recovering realistic 3D vehicle models from au- tonomous driving scenes is crucial for synthesizing training data and building simulation environment. However, most existing vehicle generation methods fail to fully exploit multimodal sensors (i.e. multi-view images and LiDAR point clouds) and rely on neural rendering based reconstruction, leading to low- quality mesh. Recently, native 3D generative models have made significant progress, yet they are not built for arbitrary multi-view inputs and often struggle with in-the-wild driving images. In this work, we present MM-TRELLIS, a multi-modal version of TRELLIS for in-the-wild 3D vehicle generation that integrates LiDAR and image sensors from autonomous driving datasets into native 3D generative models. Specifically, multi- view images are cycled as conditioning inputs, while LiDAR point clouds provide test-time guidance to ensure geometric accuracy and cross-view consistency. During denoising, we first align the guidance point cloud with the model priors, then enforce consistency between the generated geometry and the guidance point cloud. Finally, we introduce a voxel filtering strategy based on the opacity of 3D Gaussian Splatting to sup- press floaters and produce clean meshes. Comprehensive exper- iments on Waymo dataset demonstrate our method outperforms existing methods in high-fidelity 3D vehicle generation. Code is available at https://github.com/HongliXiao/MM-TRELLIS.

Index terms

Computer Vision for Automation Representation Learning Sensor Fusion