← Back ICRA 2026

GaussianFormer3D: Multi-Modal Gaussian-Based Semantic Occupancy Prediction with 3D Deformable Attention

Lingjun Zhao, Sizhe Wei, James Hays, Lu Gan

PDF

AI summary

Key figure (auto-extracted from paper)

GaussianFormer3D achieves state-of-the-art 3D semantic occupancy prediction by fusing LiDAR and camera data through a novel 3D Gaussian representation and 3D deformable attention, significantly outperforming existing methods while reducing memory usage.

3D semantic occupancy LiDAR-camera fusion 3D Gaussians deformable attention autonomous driving scene representation

Problem

Current multi-modal occupancy prediction relies on dense 3D voxels, which are computationally expensive and suffer from redundant empty grids, while existing Gaussian-based methods rely solely on 2D camera data, limiting accurate 3D geometric modeling and depth resolution.

Approach

The authors propose GaussianFormer3D, which initializes 3D Gaussians with LiDAR-derived geometry priors and refines them using a LiDAR-guided 3D deformable attention mechanism that aggregates fused LiDAR-camera features in a unified 3D space.

Key results

State-of-the-art performance on nuScenes-SurroundOcc and Occ3D datasets
Substantial accuracy gains on small objects and large surfaces
Reduced memory consumption and improved inference efficiency
Strong generalization to off-road environments with single-frame input

Why it matters

Enables more accurate, efficient, and robust 3D scene understanding for autonomous driving and robotic navigation by leveraging complementary LiDAR-camera data through a compact Gaussian representation.

Abstract

3D semantic occupancy prediction is essential for achieving safe, reliable autonomous driving and robotic nav- igation. Compared to camera-only perception systems, multi- modal pipelines, especially LiDAR-camera fusion methods, can produce more accurate and fine-grained predictions. Although voxel-based scene representations are widely used for semantic occupancy prediction, 3D Gaussians have emerged as a contin- uous and significantly more compact alternative. In this work, we propose a multi-modal Gaussian-based semantic occupancy prediction framework utilizing 3D deformable attention, namely GaussianFormer3D. We introduce a voxel-to-Gaussian initial- ization strategy that provides 3D Gaussians with accurate geom- etry priors from LiDAR data, and design a LiDAR-guided 3D deformable attention mechanism to refine these Gaussians using LiDAR-camera fusion features in a lifted 3D space. Extensive experiments on real-world on-road and off-road autonomous driving datasets demonstrate that GaussianFormer3D achieves state-of-the-art prediction performance with reduced mem- ory consumption and improved efficiency. Project website: https://lunarlab-gatech.github.io/GaussianFormer3D/.

Index terms

Deep Learning for Visual Perception Computer Vision for Transportation Sensor Fusion