← Back ICRA 2026

A Synthetic Benchmark for Collaborative 3D Semantic Occupancy Prediction in V2X-Enabled Autonomous Driving

Hanlin Wu, Pengfei Lin, Ehsan Javanmardi, Naren Bao, Bo Qian, Hao Si, Manabu Tsukada

PDF

AI summary

Key figure (auto-extracted from paper)

Collaborative perception significantly boosts 3D semantic occupancy prediction accuracy, with performance gains scaling as the collaboration range expands.

Collaborative perception 3D semantic occupancy V2X synthetic benchmark CARLA voxel prediction

Problem

Single-vehicle 3D semantic occupancy prediction is constrained by occlusions and limited sensor range, while existing datasets lack dense, voxel-level annotations for multi-agent V2X scenarios.

Approach

The authors introduce a high-resolution synthetic dataset in CARLA with dense voxel annotations and propose a baseline model that fuses multi-agent features via spatial alignment and confidence-guided attention.

Key results

Co3SOP dataset with high-resolution, dense 3D semantic voxel annotations
Collaborative baseline model using spatial alignment and confidence-guided attention
Multi-range benchmarks showing consistent performance gains over single-agent methods
Scaling prediction accuracy with expanded collaboration range under pose noise

Why it matters

Provides a critical foundation for advancing fine-grained, multi-agent scene understanding in autonomous driving research and simulation.

Abstract

3D semantic occupancy prediction is an emerging perception paradigm in autonomous driving, providing a voxel- level representation of both geometric details and semantic cate- gories. However, despite its fine-grained scene understanding, its effectiveness is inherently constrained in single-vehicle setups by occlusions, restricted sensor range, and narrow viewpoints. To address these limitations, collaborative perception enables the exchange of complementary information, thereby enhancing the completeness and accuracy of predictions. Despite its potential, research on collaborative 3D semantic occupancy prediction is hindered by the lack of dedicated datasets. To bridge this gap, we design a high-resolution semantic voxel sensor in CARLA to produce dense and comprehensive annotations for V2X scenarios. We further develop a baseline model that performs inter-agent feature fusion via spatial alignment and attention aggregation. In addition, we establish benchmarks with varying prediction ranges designed to systematically assess the impact of spatial extent on collaborative prediction. Experimental results demonstrate the superior performance of our baseline enabled by vehicle collaboration, with increasing gains observed as the prediction range expands. Our codes and data are available at https://github.com/tlab-wide/Co3SOP.

Index terms

Performance Evaluation and Benchmarking Semantic Scene Understanding Computer Vision for Automation