← Back ICRA 2026

ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting

Xiaoyang Yan, Muleilan Pei, Shaojie Shen

PDF

AI summary

Key figure (auto-extracted from paper)

ST-GS significantly improves both spatial interaction and temporal consistency in Gaussian-based 3D occupancy prediction, achieving state-of-the-art results on nuScenes.

3D semantic occupancy Gaussian splatting spatial-temporal modeling autonomous driving multi-view fusion temporal consistency

Problem

Existing Gaussian-based occupancy prediction methods lack effective multi-view spatial interaction and struggle to maintain temporal consistency across frames, leading to discontinuous and inaccurate scene reconstructions in dynamic driving environments.

Approach

The authors introduce a dual-mode attention mechanism that fuses Gaussian-guided and view-guided sampling to enhance spatial feature aggregation, combined with a geometry-aware temporal fusion module that selectively integrates historical context using ego-motion alignment.

Key results

Achieves state-of-the-art performance on the nuScenes benchmark
Delivers markedly improved temporal consistency over prior Gaussian methods
Introduces a guidance-informed spatial aggregation strategy with dual-mode attention
Designs a geometry-aware temporal fusion scheme for robust multi-frame feature alignment

Why it matters

Provides a more reliable and temporally stable 3D scene understanding pipeline for vision-centric autonomous driving systems operating in complex, dynamic environments.

Abstract

3D occupancy prediction is critical for comprehen- sive scene understanding in vision-centric autonomous driving. Recent advances have explored utilizing 3D semantic Gaussians to model occupancy while reducing computational overhead, but they remain constrained by insufficient multi-view spatial interaction and limited multi-frame temporal consistency. To overcome these issues, in this paper, we propose a novel Spatial- Temporal Gaussian Splatting (ST-GS) framework to enhance both spatial and temporal modeling in existing Gaussian-based pipelines. Specifically, we develop a guidance-informed spatial aggregation strategy within a dual-mode attention mechanism to strengthen spatial interaction in Gaussian representations. Furthermore, we introduce a geometry-aware temporal fusion scheme that effectively leverages historical context to improve temporal continuity in scene completion. Extensive experiments on the large-scale nuScenes occupancy prediction benchmark showcase that our proposed approach not only achieves state-of- the-art performance but also delivers markedly better temporal consistency compared to existing Gaussian-based methods.

Index terms

Deep Learning for Visual Perception Computer Vision for Transportation Semantic Scene Understanding