← Back ICRA 2026

GaussianPretrain: A Simple Unified 3D Gaussian Representation for Visual Pre-Training in Autonomous Driving

Shaoqing Xu, Fang LI, Shengyin Jiang, Ziying Song, Zhi-Xin Yang

PDF

AI summary

Key figure (auto-extracted from paper)

GaussianPretrain leverages 3D Gaussian Splatting for self-supervised pre-training, outperforming NeRF-based methods in 3D perception tasks while using 70% less GPU memory and training 40.6% faster.

3D Gaussian Splatting Self-supervised Learning Autonomous Driving Visual Pre-training 3D Perception Scene Reconstruction

Problem

Existing self-supervised pre-training methods for autonomous driving struggle to jointly capture geometric and texture information efficiently, often neglecting one aspect or incurring high computational costs.

Approach

The authors introduce a framework that uses learnable 3D Gaussian anchors guided by LiDAR depth to reconstruct RGB, depth, and occupancy from masked multi-view images, unifying geometric and texture learning in a single efficient representation.

Key results

First framework to apply 3D Gaussian Splatting for autonomous driving pre-training
7.05% NDS and 0.97% mAP gains in 3D object detection over prior SOTA
1.9% mAP improvement in HD map construction and 0.8% mIoU in occupancy prediction
40.6% faster training and 70% reduced GPU memory compared to NeRF-based baselines

Why it matters

It provides a highly efficient and accurate self-supervised pre-training paradigm for autonomous driving perception, enabling better scene understanding with lower computational costs for researchers and industry developers.

Abstract

Self-supervised learning has made substantial strides in image processing, while visual pre-training for autonomous driving is still in its infancy. Existing methods often focus on learning geometric scene information while neglecting texture or treating both aspects separately, hindering comprehensive scene understanding. In this context, we are excited to introduce GaussianPretrain, a novel pre-training paradigm that achieves a holistic understanding of the scene by uniformly integrating geometric and texture representations. Conceptualizing 3D Gaussian anchors as volumetric LiDAR points, our method learns a deepened understanding of scenes to enhance pre-training performance with detailed spatial struc- ture and texture, achieving that 40.6% faster than NeRF-based method UniPAD with 70% GPU memory only. We demonstrate the effectiveness of GaussianPretrain across multiple 3D per- ception tasks, showing significant performance improvements, such as a 7.05% increase in NDS for 3D object detection, boosts mAP by 1.9% in HD map construction and 0.8% improvement on Occupancy prediction. These significant gains highlight GaussianPretrain’s theoretical innovation and strong practical potential, promoting visual pre-training development for autonomous driving. The source code is available at https: //github.com/Public-BOTs/GaussianPretrain

Index terms

Computer Vision for Transportation Object Detection Segmentation and Categorization Automation Technologies for Smart Cities