← Back ICRA 2026

SemGS: Feed-Forward Semantic 3D Gaussian Splatting from Sparse Views for Generalizable Scene Understanding

Sheng Ye, Zhen-Hui Dong, Ruoyu Fan, Tian Lv, Yong-Jin Liu

PDF

AI summary

Key figure (auto-extracted from paper)

SemGS enables rapid, generalizable semantic 3D scene reconstruction and novel-view synthesis from only a few input images without per-scene optimization.

Semantic 3D Gaussian Splatting Feed-forward reconstruction Sparse-view synthesis Generalizable scene understanding Camera-aware attention Dual-Gaussian representation

Problem

Existing semantic scene reconstruction methods rely on dense multi-view inputs and require slow, scene-specific optimization, limiting their scalability and real-world applicability.

Approach

SemGS uses a feed-forward dual-branch network with camera-aware attention to extract color and semantic features from sparse views, decodes them into shared-geometry dual Gaussians, and rasterizes them to render semantic maps in a single pass.

Key results

State-of-the-art mIoU and accuracy on ScanNet and ScanNet++ benchmarks
Rapid inference speeds exceeding 6 FPS without per-scene optimization
Strong cross-domain generalization to synthetic and real-world scenes
Regional smoothness loss improves semantic coherence and boundary sharpness

Why it matters

Provides robots and vision systems with a scalable, real-time tool for high-level semantic understanding in unknown environments.

Abstract

Semantic understanding of 3D scenes is essential for robots to operate effectively and safely in complex environ- ments. Existing methods for semantic scene reconstruction and semantic-aware novel view synthesis often rely on dense multi- view inputs and require scene-specific optimization, limiting their practicality and scalability in real-world applications. To address these challenges, we propose SemGS, a feed-forward framework for reconstructing generalizable semantic fields from sparse image inputs. SemGS uses a dual-branch archi- tecture to extract color and semantic features, where the two branches share shallow CNN layers, allowing semantic reason- ing to leverage textural and structural cues in color appearance. We also incorporate a camera-aware attention mechanism into the feature extractor to explicitly model geometric relationships between camera viewpoints. The extracted features are decoded into dual-Gaussians that share geometric consistency while preserving branch-specific attributes, and further rasterized to synthesize semantic maps under novel viewpoints. Additionally, we introduce a regional smoothness loss to enhance semantic coherence. Experiments show that SemGS achieves state-of-the- art performance on benchmark datasets, while providing rapid inference and strong generalization capabilities across diverse synthetic and real-world scenarios.

Index terms

Semantic Scene Understanding Deep Learning for Visual Perception Deep Learning Methods