← Back SII 2026

Improving Robotic Imitation Learning with Predicted Facial Motion Using Transformers

Yitong Li, Fumio Kanehiro

PDF

Abstract

This study proposes a Transformer-based approach with cross‐attention for predicting human facial movements in face‐related robotic control tasks and integrating these predictions into an imitation learning framework. A dataset of human facial videos was constructed, and landmarks were extracted using the MediaPipe framework. Three prediction methods were compared, and the cross‐attention model achieved the best performance in both landmark localization accuracy and image quality. In imitation learning experiments, facial motion trajectories sampled from real human data trajectories were used, and the success rate increased from 42% to 60% and ultimately to 74% when predicted landmarks were incorporated. Additionally, varying the prediction horizon affected task completion time, with the 2‐frame horizon achieving the fastest completion. These results demonstrate that incorporating predicted facial motion can significantly enhance robotic control performance in dynamic human‐robot interaction scenarios.

Index terms

Assistive Robotics Human-robot Interaction / Collaboration Machine Learning