← Back ICRA 2026

Search3D: Hierarchical Open-Vocabulary 3D Segmentation

Ayca Takmaz, Alexandros Delitzas, Robert W. Sumner, Francis Engelmann, Johanna Wald, Federico Tombari

PDF

AI summary

Key figure (auto-extracted from paper)

Search3D enables flexible, multi-granularity 3D scene search by building a hierarchical tree of objects and parts with open-vocabulary features, outperforming baselines in segmenting objects, parts, and materials.

Open-vocabulary 3D segmentation hierarchical scene representation object part segmentation vision-language models 3D scene understanding robotic perception

Problem

Existing open-vocabulary 3D segmentation methods are limited to either object-level instances or noisy, memory-intensive point-level features, failing to robustly segment finer-grained scene entities like object parts or attribute-based regions.

Approach

Search3D constructs a hierarchical tree representation of 3D scenes by combining class-agnostic object masks with geometric over-segmentation for parts, then enriches both levels with pixel-aligned SigLIP features fused across multiple views to enable open-vocabulary text queries.

Key results

Hierarchical open-vocabulary 3D segmentation method for objects and parts
Scene-scale open-vocabulary 3D part segmentation benchmark based on MultiScan
Open-vocabulary hierarchical part annotations for ScanNet++ scenes
Superior performance over baselines in object, part, and material segmentation

Why it matters

Enables robots and assistive systems to interact with complex 3D environments by understanding and querying fine-grained scene elements beyond basic object boundaries.

Abstract

Open-vocabulary 3D segmentation enables explo- ration of 3D spaces using free-form text descriptions. Existing methods for open-vocabulary 3D instance segmentation primarily focus on identifying object-level instances but struggle with finer-grained scene entities such as object parts, or regions described by generic attributes. In this work, we introduce Search3D, an approach to construct hierarchical open-vocabulary 3D scene representations, enabling 3D search at multiple levels of granularity: fine-grained object parts, entire objects, or regions described by attributes like materials. Unlike prior methods, Search3D shifts towards a more flexible open-vocabulary 3D search paradigm, moving beyond explicit object-centric queries. For systematic evaluation, we further contribute a scene-scale open-vocabulary 3D part segmentation benchmark based on MultiScan, along with a set of open-vocabulary fine-grained part annotations on ScanNet++. Search3D outperforms baselines in scene-scale open-vocabulary 3D part segmentation, while maintaining strong performance in segmenting 3D objects and materials. Our project page is search3d-segmentation.github.io.

Index terms

Semantic Scene Understanding Object Detection Segmentation and Categorization RGB-D Perception