← Back ICRA 2026

VG3T: Visual Geometry Grounded Gaussian Transformer

Junho Kim, Seongwon Lee

PDF

AI summary

Key figure (auto-extracted from paper)

VG3T achieves state-of-the-art 3D semantic occupancy prediction with 1.7% higher mIoU while using 46% fewer Gaussian primitives by leveraging early multi-view fusion and density-aware sampling.

3D semantic occupancy multi-view fusion 3D Gaussians autonomous driving sparse representation early fusion

Problem

Existing multi-view 3D occupancy methods process camera views independently, causing fragmented geometric representations, and suffer from distance-dependent density bias that over-samples near the camera while under-representing distant objects.

Approach

VG3T fuses features from all cameras early using a geometric transformer backbone, then directly predicts a sparse set of 3D Gaussians refined by grid-based sampling and positional adjustment to ensure uniform spatial coverage.

Key results

State-of-the-art mIoU on nuScenes benchmark
46% reduction in required Gaussian primitives
Early multi-view feature fusion architecture
Grid-based sampling and positional refinement modules

Why it matters

Enables autonomous driving systems to achieve higher accuracy and computational efficiency in real-time 3D scene understanding.

Abstract

Generating a coherent 3D scene representation from multi-view images is a fundamental yet challenging task. Existing methods often struggle with multi-view fusion, leading to fragmented 3D representations and sub-optimal performance. To address this, we introduce VG3T, a novel multi-view feed-forward network that predicts a 3D semantic occupancy via a 3D Gaussian representation. Unlike prior meth- ods that infer Gaussians from single-view images, our model directly predicts a set of semantically attributed Gaussians in a joint, multi-view fashion. This novel approach overcomes the fragmentation and inconsistency inherent in view-by-view processing, offering a unified paradigm to represent both ge- ometry and semantics. We also introduce two key components, Grid-Based Sampling and Positional Refinement, to mitigate the distance-dependent density bias common in pixel-aligned Gaussian initialization methods. Our VG3T shows a notable 1.7%p improvement in mIoU while using 46% fewer primitives than the previous state-of-the-art on the nuScenes benchmark, highlighting its superior efficiency and performance. Codes are available at https://github.com/junho2000/VG3T.

Index terms

Deep Learning for Visual Perception Recognition Visual Learning