← Back ICRA 2026

EdgeGrasp: Enhancing Edge Perception for 7-DoF Grasping Pose Estimation in Cluttered Scenes

Junning Qiu, Fei Wang, Yu Guo, Yonggen Ling, Minglei Lu

PDF

AI summary

Key figure (auto-extracted from paper)

EdgeGrasp improves 7-DoF grasping in cluttered scenes by strategically fusing internal geometric and external semantic edge features via spatial attention.

7-DoF grasping edge perception RGBD fusion spatial attention robotic manipulation GraspNet-1Billion

Problem

Single-view RGBD sensors yield incomplete point clouds at object edges, making grasp pose estimation difficult. Naively combining RGB and point cloud features also fails to generalize to novel objects.

Approach

EdgeGrasp assigns internal edge geometry to voxel-based 3D convolutions on point clouds and external edge semantics to a foundation vision model on RGB images, then fuses them using an edge spatial encoding attention mechanism.

Key results

State-of-the-art performance on GraspNet-1Billion benchmark
+6.17 AP gain over prior SOTA across seen, similar, and novel test sets
Validated generalization through ablation studies on feature fusion strategies
Demonstrated practical applicability in real-world robotic grasping experiments

Why it matters

Advances robust robotic manipulation in unstructured environments by enabling accurate grasp estimation for novel objects without relying on complete 3D scans.

Abstract

Estimating 7-DoF grasping poses (6-DoF with gripper width) in cluttered scenes is a critical challenge for robotic manipulation. In such environments, object edges often contain many promising grasp candidates, but relying solely on incomplete single-view point cloud to infer them is difficult. While neural networks excel at learning edge features from RGB images, simply combining these with point clouds often fails to generalize to novel scenes. To address these challenges, we propose EdgeGrasp, which enhances edge perception by allowing each modality to contribute to the most suitable edge information source for improving grasping performance. The internal edge features are extracted through voxel-based sparse 3D convolution on the aggregated point cloud from the edge interior, ensuring a rich geometric representation while mitigating incompleteness at the edge. For external edge and junction, vision foundation model is employed to extract local zero-shot semantic features, capturing fine-grained details and improving cross-object generalization. Finally, edge spatial attention fuses these features into edge-enhanced features by encoding edge distance for estimating 7-DoF grasping poses. Experimental results demonstrate our method’s effectiveness, achieving state-of-the-art performance on the Graspnet-1Billion benchmark. Real-world robotic experiments further validate its practical applicability.

Index terms

Deep Learning Methods Perception for Grasping and Manipulation Deep Learning in Grasping and Manipulation