← Back ICRA 2026

T-FunS3D: Task-Driven Hierarchical Open-Vocabulary 3D Functionality Segmentation

Jingkun Feng, Reza Sabzevari

PDF

AI summary

Key figure (auto-extracted from paper)

T-FunS3D enables robots to efficiently segment task-relevant functional parts in 3D scenes without training, balancing accuracy with reduced computational overhead.

Open-vocabulary segmentation 3D functionality Task-driven perception Scene graph Robotics Vision-language models

Problem

Current open-vocabulary 3D segmentation methods either focus on object-level recognition or exhaustively segment entire scenes, making them too resource-intensive for robotic applications that require fine-grained, task-specific functional parts.

Approach

The method builds a lightweight open-vocabulary scene graph from 3D point clouds and RGB-D images, then uses an LLM to parse task descriptions and a vision-language model to hierarchically segment only the relevant functional components of identified objects.

Key results

Training-free hierarchical segmentation pipeline
Open-vocabulary scene graph with visual embedding nodes and edges
Task-driven localization of contextual objects and functional parts
Competitive accuracy on SceneFun3D with reduced runtime and memory

Why it matters

Enables real-world robotic systems to efficiently perceive and interact with fine-grained functional object parts without heavy computational costs.

Abstract

Open-vocabulary 3D functionality segmentation enables robots to localize functional object components in 3D scenes. It is a challenging task that requires spatial under- standing and task interpretation. Current open-vocabulary 3D segmentation methods primarily focus on object-level recog- nition, while scene-wide part segmentation methods attempt to segment the entire scene exhaustively, making them highly resource-intensive and time consuming. Balancing segmen- tation performance in terms of granularity, accuracy, and speed remains a challenge. As one step towards alleviating this, we introduce T-FunS3D, a task-driven hierarchical open- vocabulary 3D functionality segmentation method that provides actionable perception for robotic applications. Our method takes as input the 3D point cloud and posed RGB-D images of an indoor scene. We construct an open-vocabulary scene graph by extracting instances and their visual embeddings in the environment. Given a task description, T-FunS3D identifies the most relevant instances in the scene graph and locates their functional components leveraging a vision-language model. Experiments on the SceneFun3D dataset demonstrate that T-FunS3D is comparable to state-of-the-art in open-vocabulary 3D functionality segmentation, while achieving faster runtime and reduced memory usage. —Supplementary materials and code: t-funs3d.github.io.

Index terms

Semantic Scene Understanding Perception for Grasping and Manipulation Object Detection Segmentation and Categorization