← Back IROS 2024

MLPER: Multi-Level Prompts for Adaptively Enhancing Vision-Language Emotion Recognition

Yu Gao, Weihong Ren, Xinglong Xu, yan wang, Zhiyong Wang, Honghai Liu

PDF

Abstract

In the field of robotics, vision-based Emotion Recognition (ER) has achieved significant progress, but it still faces the challenge of poor generalization ability under unconstrained conditions (e.g., occlusions and pose variations). In this work, we propose MLPER model, which introduces Vision-Language Model for Emotion Recognition to learn discriminative representations adaptively. Specifically, different from typically leveraging a hand-crafted prompt (e.g., “a photo of a [class] person”), we first establish Multi-Level Prompts from three aspects: facial expression, human posture and situ- ational condition using large language models, like ChatGPT. Correspondingly, we extract the visual tokens from three levels: the face, body, and context. Further, to achieve fine-grained alignment at each level, we adopt textual tokens from the positive and the hard negative to query visual tokens, predicting whether a pair of image and text is matched. Experimental results demonstrate that our MLPER model outperforms the state-of-the-art methods on several ER benchmarks, especially under the conditions of occlusions and pose variations.

Index terms

Gesture Posture and Facial Expressions Computer Vision for Medical Robotics Emotional Robotics