Point2Act: Efficient 3D Distillation of Multimodal LLMs for Zero-Shot Context-Aware Grasping
Sang Min Kim, Hyeongjun Heo, Junho Kim, Yonghyeon Lee, Young Min Kim
AI summary
Problem
Existing foundation model approaches for 3D scene understanding are computationally expensive, struggle with complex compositional language queries, and produce diffuse, viewpoint-dependent activations that fail at precise spatial localization.
Approach
The method prompts an MLLM to directly predict 2D action points from multi-view images, then aggregates these predictions into a single-channel 3D relevancy field that efficiently guides grasp pose extraction.
Key results
- 98% object/part identification and 73% successful lift rate in zero-shot grasping
- 16.5-second full-stack pipeline latency through pipelined execution
- Robust localization under occlusion and viewpoint changes via multi-view aggregation
- Accurate handling of complex compositional and abstract language instructions
Why it matters
Provides a practical, real-time framework for generalist robots to interpret nuanced human instructions and execute precise physical actions without task-specific fine-tuning.
Abstract
We propose Point2Act, which directly retrieves the 3D action point relevant to a contextually described task, leveraging Multimodal Large Language Models (MLLMs). Foundation models have opened the possibility for generalist robots that can perform a zero-shot task following natural language descriptions within an unseen environment. While the semantics from large-scale image and language datasets provide contextual understanding in 2D images, existing methods that leverage foundation models for 3D reconstruction struggle to accurately interpret complex compositional queries and require extensive computation. Our proposed 3D relevancy fields bypass the high-dimensional features, instead efficiently imbuing lightweight 2D point-level guidance tailored to the task-specific action. The multi-view aggregation effectively com- pensates for misalignments caused by geometric ambiguities, such as occlusion, or semantic uncertainties inherent in the language descriptions. The output region is highly localized, leveraging fine-grained 3D spatial context to directly identify an explicit position for a physical action in the on-the-fly reconstruction of the scene. Our full-stack pipeline–which includes capturing, MLLM querying, 3D reconstruction, and grasp pose extraction–generates spatially grounded responses in 16.5 seconds, facilitating practical manipulation tasks.