T-FunS3D: Task-Driven Hierarchical Open-Vocabulary 3D Functionality Segmentation
Jingkun Feng, Reza Sabzevari
AI summary
Problem
Current open-vocabulary 3D segmentation methods either focus on object-level recognition or exhaustively segment entire scenes, making them too resource-intensive for robotic applications that require fine-grained, task-specific functional parts.
Approach
The method builds a lightweight open-vocabulary scene graph from 3D point clouds and RGB-D images, then uses an LLM to parse task descriptions and a vision-language model to hierarchically segment only the relevant functional components of identified objects.
Key results
- Training-free hierarchical segmentation pipeline
- Open-vocabulary scene graph with visual embedding nodes and edges
- Task-driven localization of contextual objects and functional parts
- Competitive accuracy on SceneFun3D with reduced runtime and memory
Why it matters
Enables real-world robotic systems to efficiently perceive and interact with fine-grained functional object parts without heavy computational costs.
Abstract
Open-vocabulary 3D functionality segmentation enables robots to localize functional object components in 3D scenes. It is a challenging task that requires spatial under- standing and task interpretation. Current open-vocabulary 3D segmentation methods primarily focus on object-level recog- nition, while scene-wide part segmentation methods attempt to segment the entire scene exhaustively, making them highly resource-intensive and time consuming. Balancing segmen- tation performance in terms of granularity, accuracy, and speed remains a challenge. As one step towards alleviating this, we introduce T-FunS3D, a task-driven hierarchical open- vocabulary 3D functionality segmentation method that provides actionable perception for robotic applications. Our method takes as input the 3D point cloud and posed RGB-D images of an indoor scene. We construct an open-vocabulary scene graph by extracting instances and their visual embeddings in the environment. Given a task description, T-FunS3D identifies the most relevant instances in the scene graph and locates their functional components leveraging a vision-language model. Experiments on the SceneFun3D dataset demonstrate that T-FunS3D is comparable to state-of-the-art in open-vocabulary 3D functionality segmentation, while achieving faster runtime and reduced memory usage. —Supplementary materials and code: t-funs3d.github.io.