ForecastOcc: Vision-Based Semantic Occupancy Forecasting
Riya Mohan, Hurtado Juana Valeria, Rohit Mohan, Abhinav Valada
AI summary
Problem
Existing vision-based occupancy forecasting methods lack semantic detail or rely on error-prone, separate occupancy estimation networks, preventing direct learning of spatio-temporal features from raw images.
Approach
ForecastOcc directly processes past multi-view camera images through a novel temporal cross-attention forecasting module and a 2D-to-3D view transformer to jointly predict voxel-level future occupancy and semantic categories across multiple time horizons.
Key results
- First vision-based framework for direct semantic occupancy forecasting
- Novel temporal cross-attention and 2D-to-3D transformer architecture
- Establishes new multi-view and monocular forecasting benchmarks
- Consistently outperforms adapted 2D baselines across multiple horizons
Why it matters
Enables robust, semantically rich future scene understanding for autonomous driving and robotics without relying on costly LiDAR or error-prone two-stage pipelines.
Abstract
Autonomous driving requires forecasting both geometry and semantics over time to effectively reason about future environment states. Existing vision-based occupancy forecasting methods focus on motion-related categories such as static and dynamic objects, while semantic information remains largely absent. Recent semantic occupancy forecasting approaches address this gap but rely on past occupancy predictions obtained from separate networks. This makes current methods sensitive to error accumulation and prevents learning spatio-temporal features directly from images. In this work, we present ForecastOcc, the first framework for vision-based semantic occupancy forecasting that jointly predicts future occupancy states and semantic categories. Our framework yields semantic occupancy forecasts for multiple horizons directly from past camera images, without relying on externally estimated maps. We evaluate ForecastOcc in two complementary settings: multi-view forecasting on the Occ3D-nuScenes dataset and monocular forecasting on SemanticKITTI, where we establish the first benchmark for this task. We introduce the first baselines by adapting two 2D forecasting modules within our framework. Importantly, we propose a novel architecture that incorporates a temporal cross-attention forecasting module, a 2D- to-3D view transformer, a 3D encoder for occupancy prediction, and a semantic occupancy head for voxel-level forecasts across multiple horizons. Extensive experiments on both datasets show that ForecastOcc consistently outperforms baselines, yielding semantically rich, future-aware predictions that capture scene dynamics and semantics critical for autonomous driving. We make the code publicly available at https://forecastocc. cs.uni-freiburg.de