Joint Segmentation and Grasp Pose Detection with Multi-Modal Feature Fusion Network
Xiaozheng Liu, Yunzhou Zhang, He Cao, Shan Dexing, Jiaqi Zhao
Abstract
Efficient grasp pose detection is essential for robotic manipulation in cluttered scenes. However, most meth- ods only utilize point clouds or images for prediction, ignoring the advantages of different features. In this paper, we present a multi-modal fusion network for joint segmentation and grasp pose detection. We design a point cloud and image co-guided feature fusion module that can be used to fuse features and adaptively estimate the importance of the point-pixel feature pairs. Moreover, we develop a seed point sampling algorithm that simultaneously considers the distance, semantics and at- tention scores. For selected seed points, we adopt a local feature aggregation module to fully utilize the local spatial features in the grasp region. Experimental results on the GraspNet-1Billion Dataset show that our network outperforms several state-of-the- art methods. We also conduct real robot grasping experiments to demonstrate the effectiveness of our approach.