HAGrasp: Hybrid Action Grasp Control in Cluttered Scenes Using Deep Reinforcement Learning
Kai-Tai Song, Hsiang-Hsi Chen
Abstract
Robotic autonomous grasp requires the system to perform multiple functions such as gripper and robot control, making it a task with hybrid output nature. Existing methods based on closed-loop deep reinforcement learning rely on external models for termination evaluation. To achieve more effective grasp for novel objects, we propose a new autonomous grasp control scheme termed HAGrasp that considers the complete point cloud of the workspace. It integrates grasp pose estimation, end-effector pose evaluation, and motion planning of the robotic arm into a single model, enhancing the success rate while reducing computational load. We present a closed-loop grasp control system based on deep reinforcement learning. This control system can perform grasp tasks while dynamically adjusting to avoid end-effector collisions. The design of hybrid- action reinforcement learning module is trained with unified latent action space and further improve generalization, achieving real-time autonomous grasp control. Real robot experiments show that our method has 74.2% success rate for grasping 7 unseen objects. Comparative experiments show that the proposed HAGrasp outperforms open-loop baseline Contact-Graspnet in both success rate and inference time. It is demonstrated that with integrated multi-view input and sim-to- real training design, our method improves real-world applications of autonomous grasp. INTRODUCTION Robotic autonomous grasp requires the system to perform gripper and robot control, making it a task with hybrid output nature [1]-[3]. For grasp tasks in cluttered environments, the robot needs to simultaneously perform obstacle avoidance and grasp estimation. Open-loop approaches [4]-[8], primarily focus on grasp estimation, treating motion planning as a minor problem. Most existing closed-loop methods are based on deep reinforcement learning (DRL) to handle low-level control but rely on separate models for termination evaluation [9], [10]. Grasp determination plays an important role in robotic grasp, where an efficient control system can notably mitigate computational load and cycle time. The utilization of open- loop methodologies combined with deep learning is constrained by performance bottleneck arising from single- view inputs [11]. This limitation becomes evident when objects are occluded due to their arbitrary orientations, leading to a decline in system performance. Furthermore, the absence of an additional motion planner can give rise to grasp failures caused by collisions between the robot and its surroundings. Typical closed-loop designs rely on policy learning to directly control robot actions as model outputs, but the termination condition is determined by an external model [9], [10], [12]. This decoupling of the termination evaluation from the training process leads to rewards obtained from interaction with the environment that are not directly related to the grasp task. As a result, the model struggles to understand the multimodal nature of grasp poses effectively. To enhance system performance and efficiency, it is a better practice to combine termination evaluation and end-effector control using a hybrid action control system. In this work, we present Hybrid Action Grasp control system (HAGrasp), a system that performs collision avoidance and grasp simultaneously by leveraging the geometric features of point clouds. The function of robotic autonomous grasp is divided into two parts: waypoint prediction and termination evaluation. To integrate multiple outputs into a single model, termination evaluation is treated as a learnable parameter. As shown in Fig. 1, in the hybrid-output control approach, end- effector control is a continuous action in the 6-DoF Cartesian coordinate space, while termination is a binary discrete action that control action of the parallel gripper. Further, robotic task with high-dimensionality and sparse reward raises the difficulty for RL training. Moreover, hybrid action in DRL design without further optimization enlarges performance drop in real world due to sim-to-real gap [2]. To cope with this problem, latent action space has been brought out to enable faster learning with an efficient action representation [10], [13], [14]. Our approach addresses this problem of the action space by using latent action from Conditional Variational Autoencoder (CVAE) [15]. The proposed method incorporates both outputs of discrete- continuous action space [16] into the optimization process of policy learning. An encoder is used to map continuous and discrete actions into a unified representation space for optimization. By incorporating termination evaluation into the Q-learning design, the model is optimized for end-effector control while considering the evaluation and quality of target grasp poses. This approach allows the model to learn the multi- modality and features of successful grasp. HAGrasp: Hybrid Action Grasp Control in Cluttered Scenes using Deep Reinforcement Learning Kai-Tai Song and Hsiang-Hsi Chen, Member, IEEE * Research supported by National Science and Technology Council of Taiwan R.O.C. under grant MOST 112-2221-E-A49-112-. Kai-Tai Song* is with the Institute of Electrical and Control Engineering, National Yang Ming Chiao Tung University, Hsinchu, Taiwan. (Corresponding author: 886-3-5731865, e-mail: ktsong@nycu.edu.tw) Hsiang-Hsi Chen is with the Graduate Degree Program of Robotics, National Yang Ming Chiao Tung University, Hsinchu, Taiwan (e-mail: a310605024.en10@nycu.edu.tw) Fig. 1. Illustration of HAGrasp method. The policy takes multi-view inputs, and outputs hybrid action for gripper and end-effector control. 2024 IEEE International Conference on Robotics and Automation (ICRA 2024) May 13-17, 2024. Yokohama, Japan 979-8-3503-8457-4/24/$31.00 ©2024 IEEE 3131