TACOcc: Target-Adaptive Cross-Modal Fusion with Sequential Volume Rendering for 3D Semantic Occupancy Prediction
Luyao Lei, Shuo Xu, Yifan Bai, Zelin Yang, Yuanbo Guo, Xing Wei
AI summary
Problem
Multi-modal 3D semantic occupancy prediction suffers from geometric-semantic misalignment due to fixed-neighborhood fusion and feature degradation or prediction inconsistency in dynamic scenes caused by sparse supervision.
Approach
The framework dynamically predicts optimal neighborhood sizes per query via Gumbel-Softmax to align image and LiDAR features, then stabilizes temporal predictions using a sequential volume renderer that transfers 2D photometric constraints to 3D space.
Key results
- 28.9% mIoU on nuScenes, surpassing prior SOTA by 2.3%
- Significant accuracy gains for small objects and long-range perception
- Velocity-adaptive temporal bandwidth suppresses motion-induced flickering
- Strong generalization and performance improvements on SemanticKITTI
Why it matters
Enables more robust and accurate real-time environmental perception for autonomous driving by resolving critical cross-modal alignment and temporal consistency bottlenecks.
Abstract
Multi-modal 3D semantic occupancy prediction remains challenged by two fundamental issues: (i) geometric– semantic misalignment introduced by fixed-neighborhood fu- sion under heterogeneous sensing distributions, and (ii) fea- ture degradation with prediction inconsistency in dynamic scenes caused by sparse supervision. We propose TACOcc, a framework coupling a target-adaptive, bidirectional symmetric fusion module with sequential volume rendering supervision. The fusion module predicts a query-wise neighborhood size via a differentiable Gumbel-Softmax strategy, expanding the receptive field for large objects to enrich context while con- tracting it for small objects to suppress noise, thereby achieving precise cross-modal alignment. To stabilize predictions under sparse labels and motion, we introduce temporally enhanced Gaussian rendering that aggregates multi-frame dependencies, initializes dual-source geometric anchors, and transfers multi- view photometric constraints from images to 3D occupancy features. A velocity-adaptive temporal bandwidth further mit- igates flicker in fast-motion cases. Experiments on nuScenes and SemanticKITTI demonstrate strong performance, includ- ing 28.9% mIoU on nuScenes, particularly improving small- object categories and long-range regions. These results highlight that scale-aware bidirectional fusion and temporally grounded volumetric supervision form an effective recipe for robust multi- modal occupancy perception.