Self-Supervised Street Gaussians for Autonomous Driving
Nan Huang, Xiaobao Wei, Wenzhao Zheng, Pengju An, Ming Lu, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, Shanghang Zhang
AI summary
Problem
Existing street scene reconstruction methods rely on tracked 3D vehicle bounding boxes to separate static and dynamic elements, limiting their application to real-world, in-the-wild autonomous driving scenarios due to high annotation costs.
Approach
The authors propose S3Gaussian, a self-supervised method that uses a multi-resolution Hexplane spatial-temporal field network to decompose 4D scene dynamics and predict Gaussian deformations, enabling automatic separation of static and dynamic components without explicit supervision.
Key results
- State-of-the-art PSNR and SSIM scores on Waymo-NOTR and Waymo-Street datasets
- Fully self-supervised decomposition of static and dynamic scene elements
- Superior rendering quality and temporal consistency for fast-moving and distant objects
- Efficient training requiring only ~10GB GPU memory while maintaining real-time rendering
Why it matters
Enables scalable, annotation-free 3D street scene reconstruction for high-fidelity, real-world simulators crucial for advancing end-to-end autonomous driving systems.
Abstract
Photorealistic 3D reconstruction of street scenes is a critical technique for developing real-world simulators for autonomous driving. Despite the efficacy of Neural Radiance Fields (NeRF) for driving scenes, 3D Gaussian Splatting (3DGS) emerges as a promising direction due to its faster speed and more explicit representation. However, most existing street 3DGS methods require tracked 3D vehicle bounding boxes to decompose the static and dynamic elements for effective reconstruction, limiting their applications for in-the-wild sce- narios. To facilitate efficient 3D scene reconstruction without costly annotations, we propose a self-supervised street Gaus- sian (S3Gaussian) method to decompose dynamic and static elements from 4D consistency. We represent each scene with 3D Gaussians to preserve the explicitness and further accompany them with a spatial-temporal field network to compactly model the 4D dynamics. We conduct extensive experiments on the challenging Waymo-Open dataset to evaluate the effectiveness of our method. Our S3Gaussian demonstrates the ability to decompose static and dynamic scenes and achieves the best performance without using 3D annotations. 1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University. 2UC Berkeley. 3Tsinghua University. ∗Equal contribution. †Project leader. BCorresponding author: shanghang@pku.edu.cn