Social-Qwen: From Individual Nonverbal Cues and Emotion to Multiparty Social Dynamics Understanding with Instruction Tuning
Tung Nguyen, Jouh Yeong Chew
AI summary
Problem
Existing approaches for multiparty social dynamics rely on computationally heavy large models or lack individual-level annotations, hindering accurate and real-time group engagement analysis.
Approach
Social-Qwen employs a two-stage instruction-tuned Vision-Language Model that first analyzes each participant's nonverbal cues and emotions via knowledge distillation, then summarizes these states to predict group-level engagement.
Key results
- Achieves state-of-the-art accuracy on the OUC-CGE group engagement dataset
- Demonstrates strong zero-shot generalization across multiple untrained social activities
- Successfully infers additional social dynamics like group harmony without explicit training
- Enables faster, parallelizable inference suitable for real-time robotic deployment
Why it matters
Provides a scalable, computationally efficient framework for robots and intelligent agents to understand and respond to complex multiparty social interactions in real time.
Abstract
Effective participation in multiparty scenarios re- quires robots to move beyond individual toward understanding group-level social dynamics, which are inherently complex due to the interplay of nonverbal cues, internal states, and inter- action context. Existing approaches often rely on end-to-end deterministic models, while recent state-of-the-art methods such as large Vision-Language Models (VLMs) address this issue to some extent but remain limited by their size and computational cost for real-time applications. Moreover, both approaches are constrained by the scarcity of multiparty interaction data and annotations, which describe how individual nonverbal cues and emotional states contribute to social dynamics which describe collective outcomes such as group engagement. We hypothesize that explicitly modeling individual-level states is essential for accurate group-level understanding. To this end, we present Social-Qwen, a two-stage framework that first analyzes each participant’s nonverbal cues and emotions, then infers group- level engagement using instruction-tuned representations. To mitigate the lack of individual annotations in group datasets, we employ knowledge distillation to transfer supervision signals. Experiments on the OUC-CGE dataset show that Social- Qwen significantly outperforms prior end-to-end baselines and achieves state-of-the-art performance in group engagement analysis, demonstrating the promise of instruction tuning for scalable social intelligence in robots. We further evaluate robustness by testing generalization to (1) an in-house dataset spanning multiple social activities and (2) estimating other social dynamics such as group harmony. Results suggest con- sistent performance, highlighting Social-Qwen as a promising approach toward real-time social intelligence for intelligent agents.