← Back ICRA 2026

Cognition-Inspired Dual-Stream Semantic Enhancement for Vision-Based Dynamic Emotion Modeling

Huanzhen Wang, Ziheng Zhou, Zeng Tao, Aoxing Li, Yingkai Zhao, Yuxuan Lin, Yan Wang, Wenqiang Zhang

PDF

AI summary

Key figure (auto-extracted from paper)

Emulating the brain's dual-stream cognitive mechanisms via a vision-language framework significantly boosts dynamic facial expression recognition accuracy and interpretability.

Dynamic Emotion Recognition Cognitive Computing Vision-Language Models Dual-Stream Architecture Facial Expression Analysis Cross-Modal Alignment

Problem

Existing vision-based dynamic emotion recognition models rely solely on visual cues, neglecting human cognitive processes like semantic priming and knowledge integration, which limits robustness and interpretability in real-world conditions.

Approach

DuSE integrates a Hierarchical Temporal Prompt Cluster to simulate cognitive priming through cross-modal text-visual alignment, and a Latent Semantic Emotion Aggregator to model knowledge integration, leveraging pre-trained vision-language priors.

Key results

State-of-the-art accuracy on DFEW and FERV39k benchmarks
Effective cross-modal alignment of textual prompts with facial dynamics
Enhanced interpretability through human-aligned semantic guidance
Robust performance on challenging in-the-wild video sequences

Why it matters

Offers a neurologically plausible, interpretable framework for emotion AI, advancing healthcare, robotics, and human-computer interaction applications.

Abstract

The human brain constructs emotional percepts not by processing facial expressions in isolation, but through a dynamic, hierarchical integration of sensory input with seman- tic and contextual knowledge. However, existing vision-based dynamic emotion modeling approaches often neglect emotion perception and cognitive theories. To bridge this gap between machine and human emotion perception, we propose cognition- inspired Dual-stream Semantic Enhancement (DuSE). Our model instantiates a dual-stream cognitive architecture. The first stream, a Hierarchical Temporal Prompt Cluster (HTPC), operationalizes the cognitive priming effect. It simulates how linguistic cues pre-sensitize neural pathways, modulating the processing of incoming visual stimuli by aligning textual se- mantics with fine-grained temporal features of facial dynamics. The second stream, a Latent Semantic Emotion Aggregator (LSEA), computationally models the knowledge integration process, akin to the mechanism described by the Conceptual Act Theory. It aggregates sensory inputs and synthesizes them with learned conceptual knowledge, reflecting the role of the hippocampus and default mode network in constructing a coherent emotional experience. By explicitly modeling these neuro-cognitive mechanisms, DuSE provides a more neurally plausible and robust framework for dynamic facial expression recognition (DFER). Extensive experiments on challenging in- the-wild benchmarks validate our cognition-centric approach, demonstrating that emulating the brain’s strategies for emotion processing yields state-of-the-art performance and enhances model interpretability.

Index terms

Embodied Cognitive Science Gesture Posture and Facial Expressions