← Back ICRA 2026

PinPoint3D: Fine-Grained 3D Part Segmentation from a Few Clicks

Zhengyu Lin, Zhenhong Guo, Feng Zheng,,∗

PDF

AI summary

Key figure (auto-extracted from paper)

PinPoint3D achieves high-accuracy fine-grained 3D part segmentation using only a few user clicks, significantly outperforming existing methods while drastically reducing interaction effort.

3D part segmentation interactive segmentation point clouds embodied AI few-shot interaction hierarchical decoding

Problem

Existing interactive 3D segmentation methods focus on coarse instance-level targets and struggle with sparse real-world scans, while non-interactive approaches lack annotated data and perform poorly on noisy point clouds, hindering fine-grained part-level understanding for embodied AI.

Approach

The authors introduce a novel interactive framework that uses a dual-level transformer decoder with targeted attention masking to generate precise object and part masks from sparse point clouds guided by minimal user clicks, supported by a new 3D data synthesis pipeline for training.

Key results

55.8% average IoU with one click and >71.3% with minimal clicks
Large-scale scene-level dataset with dense part annotations via novel synthesis pipeline
Up to 16% IoU and precision improvement over interactive baselines
Strong cross-domain generalization on MultiScan with reduced click requirements

Why it matters

Enables embodied AI and robotic systems to interact precisely with complex 3D environments by providing a highly efficient, low-effort interactive segmentation tool for fine-grained part manipulation.

Abstract

Fine-grained 3D part segmentation is crucial for enabling embodied AI systems to perform complex manipula- tion tasks, such as interacting with specific functional compo- nents of an object. However, existing interactive segmentation methods are largely confined to coarse, instance-level targets, while non-interactive approaches struggle with sparse, real- world scans and suffer from a severe lack of annotated data. To address these limitations, we introduce PinPoint3D, a novel interactive framework for fine-grained, multi-granularity 3D segmentation, capable of generating precise part-level masks from only a few user point clicks. A key component of our work is a new 3D data synthesis pipeline that we developed to create a large-scale, scene-level dataset with dense part annotations, overcoming a critical bottleneck that has hindered progress in this field. Through comprehensive experiments, we demonstrate that our method significantly outperforms existing approaches, achieving an average IoU of 55.8% on each object part with only one click and surpassing 71.3% IoU with a few additional click queries. Compared to current state-of-the-art baselines, PinPoint3D yields up to a 16% improvement in IoU and precision, highlighting its effectiveness and high efficiency on challenging, sparse point clouds. Our work represents a significant step towards more nuanced and precise machine perception and interaction in complex 3D environments. Our code, checkpoints and datasets can be found at the project website https://pinpoint3d.online.

Index terms

Deep Learning for Visual Perception Data Sets for Robotic Vision Recognition