← Back ICRA 2026

GaussianVLM: Scene-Centric 3D Vision-Language Models Using Language-Aligned Gaussian Splats for Embodied Reasoning and Beyond

Anna-Maria Halacheva, Jan-Nico Zaech, Xi Wang, Danda Pani Paudel, Luc Van Gool

PDF

AI summary

Key figure (auto-extracted from paper)

GaussianVLM achieves state-of-the-art 3D scene understanding by embedding language features directly into Gaussian splats and using a dual sparsifier to efficiently process dense spatial data without object detectors.

3D Vision-Language Models Gaussian Splatting Scene-Centric Reasoning Dual Sparsification Embodied AI Detector-Free Understanding

Problem

Current 3D vision-language models rely heavily on object detectors, creating processing bottlenecks and limiting their ability to capture global spatial context. They also struggle to efficiently process the dense, high-dimensional representations required for fine-grained scene understanding.

Approach

The model directly embeds language-aligned features into each 3D Gaussian primitive, then uses a dual sparsifier with task-guided and location-guided pathways to distill dense features into compact, task-relevant tokens for an LLM.

Key results

First detector-free 3D VLM operating directly on Gaussian splats
Fivefold accuracy improvement over SOTA point-cloud VLMs in out-of-domain settings
State-of-the-art performance across scene-centric and object-centric benchmarks
Dual sparsification mechanism efficiently reduces dense scene tokens to task-aware representations

Why it matters

Provides a scalable, detector-free foundation for embodied agents and spatial AI to reason holistically over complex 3D environments.

Abstract

As multimodal language models advance, their application to 3D scene understanding is a fast-growing frontier, driving the development of 3D Vision-Language Models (VLMs). Current methods show strong dependence on object detectors, introducing processing bottlenecks and limitations in taxonomic flexibility. To address these limitations, we propose a scene-centric 3D VLM for 3D Gaussian splat scenes that employs language- and task-aware scene representations. Our approach directly embeds rich linguistic features into the 3D scene representation by associating language with each Gaussian primitive, achieving early modality alignment. To process the resulting dense representations, we introduce a dual sparsifier that distills them into compact, task- relevant tokens via task-guided and location-guided pathways, producing sparse, task-aware global and local scene tokens. Notably, we present the first Gaussian splatting-based VLM, leveraging photorealistic 3D representations derived from standard RGB images, demonstrating strong generalization: it improves Manuscript received: June, 7, 2025; Revised August, 25, 2025; Accepted September, 25, 2025. This paper was recommended for publication by Editor Markus Vincze upon evaluation of the Associate Editor and Reviewers’ comments. This work was supported by the Ministry of Education and Science of Bulgaria (support for INSAIT, part of the Bulgarian National Roadmap for Research Infrastructure). 1First, Second, Fourth and Fifth Author are with INSAIT, Sofia University "St. Kliment Ohridski", Bulgaria name.surname@insait.ai 2Third Author is with INSAIT, Sofia University "St. Kliment Ohrid- ski", Bulgaria, ETH Zurich, Switzerland, and TU Munich, Germany name.surname@inf.ethz.ch Digital Object Identifier (DOI): see top of this page. performance of prior 3D VLM (LL3DA [8]) five folds, in out-of- the-domain settings.

Index terms

Semantic Scene Understanding AI-Based Methods Deep Learning for Visual Perception