EdgeGrasp: Enhancing Edge Perception for 7-DoF Grasping Pose Estimation in Cluttered Scenes
Junning Qiu, Fei Wang, Yu Guo, Yonggen Ling, Minglei Lu
AI summary
Problem
Single-view RGBD sensors yield incomplete point clouds at object edges, making grasp pose estimation difficult. Naively combining RGB and point cloud features also fails to generalize to novel objects.
Approach
EdgeGrasp assigns internal edge geometry to voxel-based 3D convolutions on point clouds and external edge semantics to a foundation vision model on RGB images, then fuses them using an edge spatial encoding attention mechanism.
Key results
- State-of-the-art performance on GraspNet-1Billion benchmark
- +6.17 AP gain over prior SOTA across seen, similar, and novel test sets
- Validated generalization through ablation studies on feature fusion strategies
- Demonstrated practical applicability in real-world robotic grasping experiments
Why it matters
Advances robust robotic manipulation in unstructured environments by enabling accurate grasp estimation for novel objects without relying on complete 3D scans.
Abstract
Estimating 7-DoF grasping poses (6-DoF with gripper width) in cluttered scenes is a critical challenge for robotic manipulation. In such environments, object edges often contain many promising grasp candidates, but relying solely on incomplete single-view point cloud to infer them is difficult. While neural networks excel at learning edge features from RGB images, simply combining these with point clouds often fails to generalize to novel scenes. To address these challenges, we propose EdgeGrasp, which enhances edge perception by allowing each modality to contribute to the most suitable edge information source for improving grasping performance. The internal edge features are extracted through voxel-based sparse 3D convolution on the aggregated point cloud from the edge interior, ensuring a rich geometric representation while mitigating incompleteness at the edge. For external edge and junction, vision foundation model is employed to extract local zero-shot semantic features, capturing fine-grained details and improving cross-object generalization. Finally, edge spatial attention fuses these features into edge-enhanced features by encoding edge distance for estimating 7-DoF grasping poses. Experimental results demonstrate our method’s effectiveness, achieving state-of-the-art performance on the Graspnet-1Billion benchmark. Real-world robotic experiments further validate its practical applicability.