← Back IROS 2024

Hierarchical Action Chunking Transformer: Learning Temporal Multimodality from Demonstrations with Fast Imitation Behavior

J. hyeon Park, Wonhyuk Choi, Sunpyo Hong, Hoseong Seo, Joonmo Ahn, Changsu Ha, Heungwoo Han, Junghyun Kwon

PDF

Abstract

Behavioral cloning from human demonstrations has succeeded in programming a robot to generate fine-grained motion, but it is still challenging to learn multimodal trajecto- ries such as with various speeds. This restricts the use of a robot dataset collected by multiusers because the different proficiency of robot operators makes the dataset have diverse distributions of speed. To tackle this issue, we develop Hierarchical Action Chunking Transformer with Vector-quantization (HACT-Vq) to efficiently learn temporal multimodality in addition to fine- grained motion. The proposed hierarchical model consists of a high-level policy to make planning for a latent subgoal and style, and a low-level policy to predict an action chunk conditioned with the latent subgoal and style. The latent subgoal and style are trained as discrete representations so that high-level policy can efficiently learn multimodal distributions of demonstrations and retrieve the mode of fast behavior. In experiments, we set up bimanual robots in both simulation and real-world environ- ments, and collected demonstrations with various speeds. The proposed model with the quantized subgoal and style showed the highest success rates with fast imitation behavior. Our code is available at https://github.com/SamsungLabs/hierarchical-act.

Index terms

Bimanual Manipulation Learning from Demonstration Deep Learning in Grasping and Manipulation