← Back ICRA 2026

GazeMoE: Perception of Gaze Target with Mixture-Of-Experts

Zhuangzhuang Dai, Zhongxi Lu, Vincent Gbouna Zakka, Luis J. Manso, Jose Maria Alcaraz Calero, Chen Li

PDF

AI summary

Key figure (auto-extracted from paper)

GazeMoE achieves state-of-the-art gaze target estimation by using a Mixture-of-Experts decoder to dynamically route and integrate diverse visual cues, significantly improving accuracy and robustness across diverse datasets.

Gaze target estimation Mixture-of-Experts Vision foundation models Human-robot interaction Class imbalance Data augmentation

Problem

Estimating human gaze targets from visible images is critical for robotics and human-computer interaction, but existing methods struggle with generalization, class imbalance between in-frame and out-of-frame targets, and adaptively integrating diverse, sometimes missing, visual cues.

Approach

GazeMoE combines a frozen DINOv2 vision encoder with a Mixture-of-Experts transformer decoder that selectively routes gaze-relevant features. It also introduces a class-balancing focal loss and comprehensive geometric and photometric augmentations to stabilize training and boost generalization.

Key results

State-of-the-art performance on GazeFollow, VideoAttentionTarget, ChildPlay, and GazeFollow360 benchmarks
Superior robustness to out-of-distribution data, including fisheye imagery and child gaze datasets
Effective resolution of in-frame vs. out-of-frame class imbalance via custom focal loss
High accuracy with only 3.4M learnable parameters, maintaining computational efficiency

Why it matters

Enables reliable gaze-following for real-world robotics, autonomous systems, and human-computer interaction across varying visual conditions and demographics.

Abstract

Estimating human gaze target from visible images is a critical task for robots to understand human attention, yet the development of generalizable neural architectures and training paradigms remains challenging. While recent advances in pre-trained vision foundation models offer promising av- enues for locating gaze targets, the integration of multi-modal cues— including eyes, head poses, gestures, and contextual features—demands adaptive and efficient decoding mecha- nisms. Inspired by Mixture-of-Experts (MoE) for adaptive domain expertise in large vision-language models, we propose GazeMoE, a novel end-to-end framework that selectively lever- ages gaze-target-related cues from a frozen foundation model through MoE modules. To address class imbalance in gaze target classification (in-frame vs. out-of-frame) and enhance robustness, GazeMoE incorporates a class-balancing auxiliary loss alongside strategic data augmentations, including region- specific cropping and photometric transformations. Extensive experiments on benchmark datasets demonstrate that our GazeMoE achieves state-of-the-art performance, outperform- ing existing methods on challenging gaze estimation tasks. The code and pre-trained models are released at: https: //huggingface.co/zdai257/GazeMoE.

Index terms

Gesture Posture and Facial Expressions Human Factors and Human-in-the-Loop Human-Robot Collaboration