← Back ICRA 2026

Semantically Consistent Language Gaussian Splatting for 3D Point-Level Open-Vocabulary Querying

Hairong Yin, Huangying Zhan, Yi Xu, Raymond Yeh

PDF

AI summary

Key figure (auto-extracted from paper)

A tracking-based distillation and ground-truth anchored querying framework significantly improves semantic consistency and accuracy for open-vocabulary 3D object retrieval.

Open-vocabulary querying 3D Gaussian Splatting Semantic consistency Ground-truth anchoring Robotics perception Language-guided segmentation

Problem

Existing 3D Gaussian Splatting methods suffer from inconsistent cross-frame supervision during language embedding distillation and rely on fixed similarity thresholds that fail across diverse queries, hindering reliable point-level retrieval for robotics.

Approach

The method uses SAM2 tracking to aggregate consistent ground-truth language features across frames for training, then introduces a ground-truth anchored querying step that dynamically calibrates retrieval thresholds relative to these consistent features.

Key results

Tracking-based distillation generates semantically consistent 3D ground-truth supervision
GT-anchored querying dynamically calibrates similarity thresholds using retrieved ground-truth features
Outperforms state-of-the-art methods across LERF, 3D-OVS, and Replica benchmarks
Achieves mIoU improvements of +4.14, +20.42, and +1.74 respectively

Why it matters

Delivers a robust, point-level querying pipeline that enables reliable open-vocabulary scene understanding for downstream robotic manipulation and navigation tasks.

Abstract

Open-vocabulary 3D scene understanding is crucial for robotics applications, such as natural language-driven ma- nipulation, human-robot interaction, and autonomous navigation. Existing methods for querying 3D Gaussian Splatting often struggle with inconsistent 2D mask supervision and lack a robust 3D point-level retrieval mechanism. In this work, (i) we present a novel point-level querying framework that performs tracking on segmentation masks to establish a semantically consistent ground- truth for distilling the language Gaussians; (ii) we introduce a GT-anchored querying approach that first retrieves the distilled ground-truth and subsequently uses the ground-truth to query the individual Gaussians. Extensive experiments on three benchmark datasets demonstrate that the proposed method outperforms state-of-the-art performance. Our method achieves an mIoU improvement of +4.14, +20.42, and +1.7 on the LERF, 3D-OVS, and Replica datasets. These results validate our framework as a promising step toward open-vocabulary understanding in real- world robotic systems.

Index terms

Semantic Scene Understanding Object Detection Segmentation and Categorization