Modality Attention for Prediction-Based Robot Motion Generation: Improving Interpretability and Robustness of Using Multi-Modality
Hideyuki Ichiwara, Hiroshi Ito, Kenjiro Yamamoto, Hiroki Mori, Tetsuya Ogata
Abstract
We developed a modality attention motion genera- tion model on the basis of multi-modality prediction. This model provides interpretability about modality usage and demonstrates robustness against disturbances. We used a hierarchical model consisting of low-level recurrent neural networks (RNNs) for processing each modality individually and a high-level RNN that integrates the multi-modality. This integration is achieved by efficiently gating multi-modality and inputting it to the high- level RNN. We verified the interpretability and robustness of the task of inserting a furniture part, which consists of the “approach” phase to bring the wooden dowel closer to the hole and the “insertion” phase. While the proposed model achieves the same task success rate as the conventional model, it clarifies that it refers to vision during “approach” and force during “insertion,” providing interpretability regarding modality use. Furthermore, in contrast to the non-modality attention model, whose task success rate drops significantly under disturbance, the proposed model enhances robustness against disturbances to modalities it does not direct attention during the task, resulting in a consistently high success rate (≃90%).