GazeMoE: Perception of Gaze Target with Mixture-Of-Experts
Zhuangzhuang Dai, Zhongxi Lu, Vincent Gbouna Zakka, Luis J. Manso, Jose Maria Alcaraz Calero, Chen Li
AI summary
Problem
Estimating human gaze targets from visible images is critical for robotics and human-computer interaction, but existing methods struggle with generalization, class imbalance between in-frame and out-of-frame targets, and adaptively integrating diverse, sometimes missing, visual cues.
Approach
GazeMoE combines a frozen DINOv2 vision encoder with a Mixture-of-Experts transformer decoder that selectively routes gaze-relevant features. It also introduces a class-balancing focal loss and comprehensive geometric and photometric augmentations to stabilize training and boost generalization.
Key results
- State-of-the-art performance on GazeFollow, VideoAttentionTarget, ChildPlay, and GazeFollow360 benchmarks
- Superior robustness to out-of-distribution data, including fisheye imagery and child gaze datasets
- Effective resolution of in-frame vs. out-of-frame class imbalance via custom focal loss
- High accuracy with only 3.4M learnable parameters, maintaining computational efficiency
Why it matters
Enables reliable gaze-following for real-world robotics, autonomous systems, and human-computer interaction across varying visual conditions and demographics.
Abstract
Estimating human gaze target from visible images is a critical task for robots to understand human attention, yet the development of generalizable neural architectures and training paradigms remains challenging. While recent advances in pre-trained vision foundation models offer promising av- enues for locating gaze targets, the integration of multi-modal cues— including eyes, head poses, gestures, and contextual features—demands adaptive and efficient decoding mecha- nisms. Inspired by Mixture-of-Experts (MoE) for adaptive domain expertise in large vision-language models, we propose GazeMoE, a novel end-to-end framework that selectively lever- ages gaze-target-related cues from a frozen foundation model through MoE modules. To address class imbalance in gaze target classification (in-frame vs. out-of-frame) and enhance robustness, GazeMoE incorporates a class-balancing auxiliary loss alongside strategic data augmentations, including region- specific cropping and photometric transformations. Extensive experiments on benchmark datasets demonstrate that our GazeMoE achieves state-of-the-art performance, outperform- ing existing methods on challenging gaze estimation tasks. The code and pre-trained models are released at: https: //huggingface.co/zdai257/GazeMoE.