Vision-Based Panoptic Occupancy Prediction in Urban Environments
Rodrigo Marcuzzi, Lucas Nunes, Elias Ariel Marks, Xingguang Zhong, Jens Behley, Cyrill Stachniss
AI summary
Problem
Existing 3D panoptic occupancy methods rely on expensive LiDAR scans or manual 3D voxel labels, while RGB-only approaches typically lack instance-level details and fail to model dynamic objects.
Approach
The pipeline generates panoptic occupancy pseudo-labels from multi-view RGB images by combining bundle adjustment for depth, a foundation model for semantic/instance segmentation, and a 3D foundation model to detect and incorporate dynamic objects for explicit network training.
Key results
- First LiDAR-free method for 3D panoptic occupancy prediction
- State-of-the-art semantic occupancy performance among label-free methods
- Explicit prediction of dynamic objects using only RGB data
- Voxel-level occupancy, semantic, and instance ID outputs across the full 3D grid
Why it matters
Eliminates the need for costly LiDAR sensors and manual 3D annotations, enabling scalable and cost-effective scene understanding for autonomous driving systems.
Abstract
Understanding the surrounding scene geometri- cally and semantically is a key requirement for autonomously navigating systems. Vision-based 3D panoptic occupancy pre- diction aims to provide a 3D representation of the surroundings including semantic meaning and identifying individual objects such as traffic participants in the context of urban naviga- tion. The majority of vision-based approaches to occupancy prediction require 3D voxel labels or segmented LiDAR scans as supervision signal. While other vision-based approaches use only a few consecutive images for supervision, these approaches typically do not provide instance-level information, which is crucial for achieving a holistic understanding of the scene. In this paper, we propose a novel method for 3D panoptic occupancy prediction that relies solely on image data for both training and inference. We use bundle adjustment to align all available images in the training set to obtain depth information. We further use a pre-trained open-vocabulary image model to obtain panoptic segmentation of the RGB images and generate occupancy pseudo labels to directly optimize for the 3D panoptic occupancy prediction task. Furthermore, we use a 3D foundation model to obtain depth predictions for individual images to add dynamic objects into the pseudo labels. Without any manual or LiDAR-based annotations, our approach outputs occupancy, semantic class, and instance ID for each 3D voxel in the full voxel grid. We achieve state-of-the-art results on 3D semantic occupancy prediction among label-free methods, and we propose the first method for 3D panoptic occupancy without any LiDAR supervision.