← Back ICRA 2026

Clutt3R-Seg: Sparse-View 3D Instance Segmentation for Language-Grounded Grasping in Cluttered Scenes

Jeongho Noh, Tai Hyoung Rhee, Eunho Lee, Jeongyun Kim, Sunwoo Lee, Ayoung Kim

PDF

AI summary

Key figure (auto-extracted from paper)

Clutt3R-Seg enables robust, zero-shot 3D instance segmentation and language-grounded multi-stage grasping in heavily cluttered, sparse-view scenes by organizing noisy masks into a hierarchical tree and correcting them via cross-view grouping and conditional substitution.

Sparse-view segmentation 3D instance segmentation language-grounded grasping hierarchical mask grouping cluttered scenes zero-shot robotics

Problem

Reliable 3D instance segmentation fails in cluttered environments due to heavy occlusions, limited viewpoints, and noisy masks that cause over- and under-segmentation, breaking cross-view consistency and hindering language-grounded robotic grasping.

Approach

The method builds a hierarchical instance tree from noisy 2D masks to resolve segmentation errors, then groups masks across views using spatial and semantic similarity before applying conditional parent substitution to yield view-consistent 3D instances enriched with open-vocabulary embeddings.

Key results

61.66 AP@25 on heavy-clutter sequences, surpassing baselines by 2.2×
Outperforms MaskClustering with half the input views (4 vs. 8)
Enables zero-shot language-grounded target identification and multi-stage grasping
Maintains segmentation consistency after object displacement using only a single post-interaction image

Why it matters

It provides a robust, computationally efficient perception pipeline that allows robots to reliably identify and manipulate specific objects in complex, dynamic environments using natural language commands.

Abstract

Reliable 3D instance segmentation is fundamental to language-grounded robotic manipulation. Its critical appli- cation lies in cluttered environments, where occlusions, limited viewpoints, and noisy masks degrade perception. To address these challenges, we present Clutt3R-Seg, a zero-shot pipeline for robust 3D instance segmentation for language-grounded grasping in cluttered scenes. Our key idea is to introduce a hierarchical instance tree of semantic cues. Unlike prior approaches that attempt to refine noisy masks, our method leverages them as informative cues: through cross-view group- ing and conditional substitution, the tree suppresses over- and under-segmentation, yielding view-consistent masks and robust 3D instances. Each instance is enriched with open-vocabulary semantic embeddings, enabling accurate target selection from natural language instructions. To handle scene changes during multi-stage tasks, we further introduce a consistency-aware up- date that preserves instance correspondences from only a single post-interaction image, allowing efficient adaptation without rescanning. Clutt3R-Seg is evaluated on both synthetic and real- world datasets, and validated on a real robot. Across all settings, it consistently outperforms state-of-the-art baselines in cluttered and sparse-view scenarios. Even on the most challenging heavy- clutter sequences, Clutt3R-Seg achieves an AP@25 of 61.66, over 2.2× higher than baselines, and with only four input views it surpasses MaskClustering with eight views by more than 2×. The code is available at: https://github.com/jeonghonoh/clutt3r- seg.

Index terms

Perception for Grasping and Manipulation Object Detection Segmentation and Categorization Deep Learning for Visual Perception