UnsOcc: 3D Semantic Occupancy Prediction in Unstructured Scene Via Rendering Fusion
Ye Wu, Ruiqi Song, Baiyong Ding, Nanxing Zeng, JunJie Cheng, Yunfeng Ai
AI summary
Problem
Traditional 3D perception methods struggle in unstructured scenes due to sparse layouts, irregular obstacles, and severe long-tail class distributions that hinder cross-modal fusion and prediction accuracy.
Approach
The authors propose UnsOcc, which aligns image and LiDAR features through bidirectional rendering supervision and uses 3D Gaussian Splatting to project sparse 3D predictions into dense 2D maps for auxiliary supervision.
Key results
- Introduction of RenderFusion for cross-modal feature alignment via bidirectional rendering supervision
- Development of GSRefinement for detail-aware auxiliary supervision using 3D Gaussian Splatting
- Creation of a dedicated open-pit mine dataset for unstructured scene perception
- State-of-the-art performance on both the new dataset and nuScenes benchmark
Why it matters
Enables robust 3D scene understanding for autonomous vehicles operating in complex, unstructured environments like mining sites and construction zones.
Abstract
Unstructured scenes present unique challenges for autonomous driving, as irregular obstacles and sparse scene layouts undermine the effectiveness of traditional perception methods such as 3D object detection. 3D semantic occupancy prediction has emerged as a prominent focus due to its ability to provide dense spatial representations by assigning semantic la- bels to individual voxels in 3D space. However, directly applying 3D semantic occupancy prediction to unstructured scenes re- mains challenging because scene sparsity hinders effective cross- modal fusion and the more severe long-tail distribution in these scenarios further degrades prediction performance. To validate the effectiveness of our approach, we construct a dedicated dataset of unstructured scenes collected from open-pit mines. Based on this, we propose UnsOcc, a multi-modal 3D semantic occupancy prediction framework that improves robustness in unstructured environments. At its core, we introduce a rendering-based fusion module, RenderFusion, which enhances cross-modal feature alignment through bidirectional rendering supervision. Furthermore, we propose GSRefinement, a detail- aware auxiliary supervision method based on Gaussian Splat- ting that projects sparse 3D occupancy predictions into dense 2D semantic segmentation maps, enabling effective supervision for long-tail categories. Extensive experiments on both the open- pit mine dataset and the nuScenes dataset demonstrate that our method significantly outperforms existing state-of-the-art approaches.