← Back ICRA 2026

TACOcc: Target-Adaptive Cross-Modal Fusion with Sequential Volume Rendering for 3D Semantic Occupancy Prediction

Luyao Lei, Shuo Xu, Yifan Bai, Zelin Yang, Yuanbo Guo, Xing Wei

PDF

AI summary

Key figure (auto-extracted from paper)

TACOcc achieves state-of-the-art 3D semantic occupancy prediction by dynamically aligning cross-modal features and stabilizing predictions in dynamic scenes through sequential volume rendering.

3D semantic occupancy cross-modal fusion sequential volume rendering adaptive neighborhood autonomous driving 3D Gaussian splatting

Problem

Multi-modal 3D semantic occupancy prediction suffers from geometric-semantic misalignment due to fixed-neighborhood fusion and feature degradation or prediction inconsistency in dynamic scenes caused by sparse supervision.

Approach

The framework dynamically predicts optimal neighborhood sizes per query via Gumbel-Softmax to align image and LiDAR features, then stabilizes temporal predictions using a sequential volume renderer that transfers 2D photometric constraints to 3D space.

Key results

28.9% mIoU on nuScenes, surpassing prior SOTA by 2.3%
Significant accuracy gains for small objects and long-range perception
Velocity-adaptive temporal bandwidth suppresses motion-induced flickering
Strong generalization and performance improvements on SemanticKITTI

Why it matters

Enables more robust and accurate real-time environmental perception for autonomous driving by resolving critical cross-modal alignment and temporal consistency bottlenecks.

Abstract

Multi-modal 3D semantic occupancy prediction remains challenged by two fundamental issues: (i) geometric– semantic misalignment introduced by fixed-neighborhood fu- sion under heterogeneous sensing distributions, and (ii) fea- ture degradation with prediction inconsistency in dynamic scenes caused by sparse supervision. We propose TACOcc, a framework coupling a target-adaptive, bidirectional symmetric fusion module with sequential volume rendering supervision. The fusion module predicts a query-wise neighborhood size via a differentiable Gumbel-Softmax strategy, expanding the receptive field for large objects to enrich context while con- tracting it for small objects to suppress noise, thereby achieving precise cross-modal alignment. To stabilize predictions under sparse labels and motion, we introduce temporally enhanced Gaussian rendering that aggregates multi-frame dependencies, initializes dual-source geometric anchors, and transfers multi- view photometric constraints from images to 3D occupancy features. A velocity-adaptive temporal bandwidth further mit- igates flicker in fast-motion cases. Experiments on nuScenes and SemanticKITTI demonstrate strong performance, includ- ing 28.9% mIoU on nuScenes, particularly improving small- object categories and long-range regions. These results highlight that scale-aware bidirectional fusion and temporally grounded volumetric supervision form an effective recipe for robust multi- modal occupancy perception.

Index terms

Object Detection Segmentation and Categorization Semantic Scene Understanding Deep Learning Methods