← Back ICRA 2026

Self-Supervised Street Gaussians for Autonomous Driving

Nan Huang, Xiaobao Wei, Wenzhao Zheng, Pengju An, Ming Lu, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, Shanghang Zhang

PDF

AI summary

Key figure (auto-extracted from paper)

S3Gaussian achieves state-of-the-art photorealistic 3D reconstruction and novel view synthesis for dynamic street scenes without requiring costly 3D bounding box annotations.

3D Gaussian Splatting Self-Supervised Learning Dynamic Scene Reconstruction Autonomous Driving Simulation Novel View Synthesis Spatial-Temporal Modeling

Problem

Existing street scene reconstruction methods rely on tracked 3D vehicle bounding boxes to separate static and dynamic elements, limiting their application to real-world, in-the-wild autonomous driving scenarios due to high annotation costs.

Approach

The authors propose S3Gaussian, a self-supervised method that uses a multi-resolution Hexplane spatial-temporal field network to decompose 4D scene dynamics and predict Gaussian deformations, enabling automatic separation of static and dynamic components without explicit supervision.

Key results

State-of-the-art PSNR and SSIM scores on Waymo-NOTR and Waymo-Street datasets
Fully self-supervised decomposition of static and dynamic scene elements
Superior rendering quality and temporal consistency for fast-moving and distant objects
Efficient training requiring only ~10GB GPU memory while maintaining real-time rendering

Why it matters

Enables scalable, annotation-free 3D street scene reconstruction for high-fidelity, real-world simulators crucial for advancing end-to-end autonomous driving systems.

Abstract

Photorealistic 3D reconstruction of street scenes is a critical technique for developing real-world simulators for autonomous driving. Despite the efficacy of Neural Radiance Fields (NeRF) for driving scenes, 3D Gaussian Splatting (3DGS) emerges as a promising direction due to its faster speed and more explicit representation. However, most existing street 3DGS methods require tracked 3D vehicle bounding boxes to decompose the static and dynamic elements for effective reconstruction, limiting their applications for in-the-wild sce- narios. To facilitate efficient 3D scene reconstruction without costly annotations, we propose a self-supervised street Gaus- sian (S3Gaussian) method to decompose dynamic and static elements from 4D consistency. We represent each scene with 3D Gaussians to preserve the explicitness and further accompany them with a spatial-temporal field network to compactly model the 4D dynamics. We conduct extensive experiments on the challenging Waymo-Open dataset to evaluate the effectiveness of our method. Our S3Gaussian demonstrates the ability to decompose static and dynamic scenes and achieves the best performance without using 3D annotations. 1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University. 2UC Berkeley. 3Tsinghua University. ∗Equal contribution. †Project leader. BCorresponding author: shanghang@pku.edu.cn

Index terms

Computer Vision for Transportation Intelligent Transportation Systems