← Back ICRA 2026

UnsOcc: 3D Semantic Occupancy Prediction in Unstructured Scene Via Rendering Fusion

Ye Wu, Ruiqi Song, Baiyong Ding, Nanxing Zeng, JunJie Cheng, Yunfeng Ai

PDF

AI summary

Key figure (auto-extracted from paper)

A novel multi-modal framework leveraging bidirectional rendering supervision and 3D Gaussian Splatting significantly improves 3D semantic occupancy prediction in sparse, unstructured environments.

3D semantic occupancy unstructured scenes multi-modal fusion 3D Gaussian splatting rendering supervision autonomous driving

Problem

Traditional 3D perception methods struggle in unstructured scenes due to sparse layouts, irregular obstacles, and severe long-tail class distributions that hinder cross-modal fusion and prediction accuracy.

Approach

The authors propose UnsOcc, which aligns image and LiDAR features through bidirectional rendering supervision and uses 3D Gaussian Splatting to project sparse 3D predictions into dense 2D maps for auxiliary supervision.

Key results

Introduction of RenderFusion for cross-modal feature alignment via bidirectional rendering supervision
Development of GSRefinement for detail-aware auxiliary supervision using 3D Gaussian Splatting
Creation of a dedicated open-pit mine dataset for unstructured scene perception
State-of-the-art performance on both the new dataset and nuScenes benchmark

Why it matters

Enables robust 3D scene understanding for autonomous vehicles operating in complex, unstructured environments like mining sites and construction zones.

Abstract

Unstructured scenes present unique challenges for autonomous driving, as irregular obstacles and sparse scene layouts undermine the effectiveness of traditional perception methods such as 3D object detection. 3D semantic occupancy prediction has emerged as a prominent focus due to its ability to provide dense spatial representations by assigning semantic la- bels to individual voxels in 3D space. However, directly applying 3D semantic occupancy prediction to unstructured scenes re- mains challenging because scene sparsity hinders effective cross- modal fusion and the more severe long-tail distribution in these scenarios further degrades prediction performance. To validate the effectiveness of our approach, we construct a dedicated dataset of unstructured scenes collected from open-pit mines. Based on this, we propose UnsOcc, a multi-modal 3D semantic occupancy prediction framework that improves robustness in unstructured environments. At its core, we introduce a rendering-based fusion module, RenderFusion, which enhances cross-modal feature alignment through bidirectional rendering supervision. Furthermore, we propose GSRefinement, a detail- aware auxiliary supervision method based on Gaussian Splat- ting that projects sparse 3D occupancy predictions into dense 2D semantic segmentation maps, enabling effective supervision for long-tail categories. Extensive experiments on both the open- pit mine dataset and the nuScenes dataset demonstrate that our method significantly outperforms existing state-of-the-art approaches.

Index terms

Mining Robotics Field Robots