← Back ICRA 2026

Point2Act: Efficient 3D Distillation of Multimodal LLMs for Zero-Shot Context-Aware Grasping

Sang Min Kim, Hyeongjun Heo, Junho Kim, Yonghyeon Lee, Young Min Kim

PDF

AI summary

Key figure (auto-extracted from paper)

Point2Act distills multi-view MLLM predictions into a lightweight 3D relevancy field, enabling fast, accurate, zero-shot context-aware robotic grasping that outperforms existing methods.

Context-aware grasping Multimodal LLMs 3D relevancy fields Zero-shot manipulation Multi-view aggregation Robotic reasoning

Problem

Existing foundation model approaches for 3D scene understanding are computationally expensive, struggle with complex compositional language queries, and produce diffuse, viewpoint-dependent activations that fail at precise spatial localization.

Approach

The method prompts an MLLM to directly predict 2D action points from multi-view images, then aggregates these predictions into a single-channel 3D relevancy field that efficiently guides grasp pose extraction.

Key results

98% object/part identification and 73% successful lift rate in zero-shot grasping
16.5-second full-stack pipeline latency through pipelined execution
Robust localization under occlusion and viewpoint changes via multi-view aggregation
Accurate handling of complex compositional and abstract language instructions

Why it matters

Provides a practical, real-time framework for generalist robots to interpret nuanced human instructions and execute precise physical actions without task-specific fine-tuning.

Abstract

We propose Point2Act, which directly retrieves the 3D action point relevant to a contextually described task, leveraging Multimodal Large Language Models (MLLMs). Foundation models have opened the possibility for generalist robots that can perform a zero-shot task following natural language descriptions within an unseen environment. While the semantics from large-scale image and language datasets provide contextual understanding in 2D images, existing methods that leverage foundation models for 3D reconstruction struggle to accurately interpret complex compositional queries and require extensive computation. Our proposed 3D relevancy fields bypass the high-dimensional features, instead efficiently imbuing lightweight 2D point-level guidance tailored to the task-specific action. The multi-view aggregation effectively com- pensates for misalignments caused by geometric ambiguities, such as occlusion, or semantic uncertainties inherent in the language descriptions. The output region is highly localized, leveraging fine-grained 3D spatial context to directly identify an explicit position for a physical action in the on-the-fly reconstruction of the scene. Our full-stack pipeline–which includes capturing, MLLM querying, 3D reconstruction, and grasp pose extraction–generates spatially grounded responses in 16.5 seconds, facilitating practical manipulation tasks.

Index terms

Perception for Grasping and Manipulation Deep Learning in Grasping and Manipulation Deep Learning for Visual Perception