← Back ICRA 2026

ForecastOcc: Vision-Based Semantic Occupancy Forecasting

Riya Mohan, Hurtado Juana Valeria, Rohit Mohan, Abhinav Valada

PDF

AI summary

Key figure (auto-extracted from paper)

ForecastOcc directly predicts future 3D semantic occupancy from camera images, outperforming baselines by jointly capturing spatial, temporal, and semantic scene dynamics.

Semantic occupancy forecasting Vision-based prediction Autonomous driving Temporal cross-attention 3D scene understanding Multi-view learning

Problem

Existing vision-based occupancy forecasting methods lack semantic detail or rely on error-prone, separate occupancy estimation networks, preventing direct learning of spatio-temporal features from raw images.

Approach

ForecastOcc directly processes past multi-view camera images through a novel temporal cross-attention forecasting module and a 2D-to-3D view transformer to jointly predict voxel-level future occupancy and semantic categories across multiple time horizons.

Key results

First vision-based framework for direct semantic occupancy forecasting
Novel temporal cross-attention and 2D-to-3D transformer architecture
Establishes new multi-view and monocular forecasting benchmarks
Consistently outperforms adapted 2D baselines across multiple horizons

Why it matters

Enables robust, semantically rich future scene understanding for autonomous driving and robotics without relying on costly LiDAR or error-prone two-stage pipelines.

Abstract

Autonomous driving requires forecasting both geometry and semantics over time to effectively reason about future environment states. Existing vision-based occupancy forecasting methods focus on motion-related categories such as static and dynamic objects, while semantic information remains largely absent. Recent semantic occupancy forecasting approaches address this gap but rely on past occupancy predictions obtained from separate networks. This makes current methods sensitive to error accumulation and prevents learning spatio-temporal features directly from images. In this work, we present ForecastOcc, the first framework for vision-based semantic occupancy forecasting that jointly predicts future occupancy states and semantic categories. Our framework yields semantic occupancy forecasts for multiple horizons directly from past camera images, without relying on externally estimated maps. We evaluate ForecastOcc in two complementary settings: multi-view forecasting on the Occ3D-nuScenes dataset and monocular forecasting on SemanticKITTI, where we establish the first benchmark for this task. We introduce the first baselines by adapting two 2D forecasting modules within our framework. Importantly, we propose a novel architecture that incorporates a temporal cross-attention forecasting module, a 2D- to-3D view transformer, a 3D encoder for occupancy prediction, and a semantic occupancy head for voxel-level forecasts across multiple horizons. Extensive experiments on both datasets show that ForecastOcc consistently outperforms baselines, yielding semantically rich, future-aware predictions that capture scene dynamics and semantics critical for autonomous driving. We make the code publicly available at https://forecastocc. cs.uni-freiburg.de

Index terms

Deep Learning for Visual Perception Semantic Scene Understanding Computer Vision for Transportation