← Back ICRA 2026

ADGaussian: Generalizable Gaussian Splatting for Autonomous Driving Via Multi-Modal Joint Learning

Qi Song, Chenghong Li, Haotong Lin, Sida Peng, Rui Huang

PDF

AI summary

Key figure (auto-extracted from paper)

ADGaussian achieves state-of-the-art street scene reconstruction and novel-view synthesis by jointly optimizing visual and geometric features using sparse LiDAR depth and monocular images.

Gaussian Splatting Autonomous Driving Multi-modal Learning LiDAR Depth Novel View Synthesis Generalizable Reconstruction

Problem

Existing generalizable Gaussian Splatting methods for autonomous driving struggle with spatial misalignment, inconsistent depth accuracy from pre-trained models, and poor generalization across diverse urban scenarios without per-scene fine-tuning.

Approach

The method integrates sparse LiDAR depth as a second modality alongside monocular images, using a Multi-modal Feature Matching strategy with Depth-guided Positional Embedding and a Multi-scale Gaussian Decoding model to jointly optimize appearance and geometry for 3D Gaussian prediction.

Key results

State-of-the-art rendering and geometric accuracy on Waymo and KITTI benchmarks
Robust zero-shot generalization under extreme viewpoint shifts
Effective resolution of spatial misalignment in multi-modal fusion
Validated contributions of Depth-guided Positional Embedding and multi-scale decoding

Why it matters

Provides a practical, pose-aware framework for high-fidelity 3D street scene reconstruction, advancing real-time perception and simulation for autonomous driving systems.

Abstract

We present a novel approach, termed ADGaus- sian, for generalizable street scene reconstruction. The proposed method enables high-quality rendering from merely single-view input. Unlike prior Gaussian Splatting methods that primarily focus on geometry refinement, we emphasize the importance of joint optimization of image and depth features for accurate Gaussian prediction. To this end, we first incorporate sparse LiDAR depth as an additional input modality, formulating the Gaussian prediction process as a joint learning framework of visual information and geometric clue. Furthermore, we propose a Multi-modal Feature Matching strategy coupled with a Multi-scale Gaussian Decoding model to enhance the joint refinement of multi-modal features, thereby enabling effi- cient multi-modal Gaussian learning. Extensive experiments on Waymo and KITTI demonstrate that our ADGaussian achieves state-of-the-art performance and exhibits superior zero-shot generalization capabilities in novel-view shifting. Project page.

Index terms

RGB-D Perception Computer Vision for Transportation Deep Learning for Visual Perception