← Back ICRA 2026

Vision-Based Panoptic Occupancy Prediction in Urban Environments

Rodrigo Marcuzzi, Lucas Nunes, Elias Ariel Marks, Xingguang Zhong, Jens Behley, Cyrill Stachniss

PDF

AI summary

Key figure (auto-extracted from paper)

We introduce the first LiDAR-free method for 3D panoptic occupancy prediction that generates explicit supervision from RGB images alone, achieving state-of-the-art performance while capturing dynamic objects.

3D occupancy prediction panoptic segmentation RGB-only supervision pseudo-label generation autonomous navigation foundation models

Problem

Existing 3D panoptic occupancy methods rely on expensive LiDAR scans or manual 3D voxel labels, while RGB-only approaches typically lack instance-level details and fail to model dynamic objects.

Approach

The pipeline generates panoptic occupancy pseudo-labels from multi-view RGB images by combining bundle adjustment for depth, a foundation model for semantic/instance segmentation, and a 3D foundation model to detect and incorporate dynamic objects for explicit network training.

Key results

First LiDAR-free method for 3D panoptic occupancy prediction
State-of-the-art semantic occupancy performance among label-free methods
Explicit prediction of dynamic objects using only RGB data
Voxel-level occupancy, semantic, and instance ID outputs across the full 3D grid

Why it matters

Eliminates the need for costly LiDAR sensors and manual 3D annotations, enabling scalable and cost-effective scene understanding for autonomous driving systems.

Abstract

Understanding the surrounding scene geometri- cally and semantically is a key requirement for autonomously navigating systems. Vision-based 3D panoptic occupancy pre- diction aims to provide a 3D representation of the surroundings including semantic meaning and identifying individual objects such as traffic participants in the context of urban naviga- tion. The majority of vision-based approaches to occupancy prediction require 3D voxel labels or segmented LiDAR scans as supervision signal. While other vision-based approaches use only a few consecutive images for supervision, these approaches typically do not provide instance-level information, which is crucial for achieving a holistic understanding of the scene. In this paper, we propose a novel method for 3D panoptic occupancy prediction that relies solely on image data for both training and inference. We use bundle adjustment to align all available images in the training set to obtain depth information. We further use a pre-trained open-vocabulary image model to obtain panoptic segmentation of the RGB images and generate occupancy pseudo labels to directly optimize for the 3D panoptic occupancy prediction task. Furthermore, we use a 3D foundation model to obtain depth predictions for individual images to add dynamic objects into the pseudo labels. Without any manual or LiDAR-based annotations, our approach outputs occupancy, semantic class, and instance ID for each 3D voxel in the full voxel grid. We achieve state-of-the-art results on 3D semantic occupancy prediction among label-free methods, and we propose the first method for 3D panoptic occupancy without any LiDAR supervision.

Index terms

Semantic Scene Understanding Deep Learning Methods