← Back ICRA 2026

Social-Qwen: From Individual Nonverbal Cues and Emotion to Multiparty Social Dynamics Understanding with Instruction Tuning

Tung Nguyen, Jouh Yeong Chew

PDF

AI summary

Key figure (auto-extracted from paper)

Explicitly modeling individual nonverbal cues and emotions before summarizing them significantly improves group engagement prediction accuracy and generalization over end-to-end baselines.

Group engagement Vision-Language Models Knowledge Distillation Social Dynamics Instruction Tuning Robotics

Problem

Existing approaches for multiparty social dynamics rely on computationally heavy large models or lack individual-level annotations, hindering accurate and real-time group engagement analysis.

Approach

Social-Qwen employs a two-stage instruction-tuned Vision-Language Model that first analyzes each participant's nonverbal cues and emotions via knowledge distillation, then summarizes these states to predict group-level engagement.

Key results

Achieves state-of-the-art accuracy on the OUC-CGE group engagement dataset
Demonstrates strong zero-shot generalization across multiple untrained social activities
Successfully infers additional social dynamics like group harmony without explicit training
Enables faster, parallelizable inference suitable for real-time robotic deployment

Why it matters

Provides a scalable, computationally efficient framework for robots and intelligent agents to understand and respond to complex multiparty social interactions in real time.

Abstract

Effective participation in multiparty scenarios re- quires robots to move beyond individual toward understanding group-level social dynamics, which are inherently complex due to the interplay of nonverbal cues, internal states, and inter- action context. Existing approaches often rely on end-to-end deterministic models, while recent state-of-the-art methods such as large Vision-Language Models (VLMs) address this issue to some extent but remain limited by their size and computational cost for real-time applications. Moreover, both approaches are constrained by the scarcity of multiparty interaction data and annotations, which describe how individual nonverbal cues and emotional states contribute to social dynamics which describe collective outcomes such as group engagement. We hypothesize that explicitly modeling individual-level states is essential for accurate group-level understanding. To this end, we present Social-Qwen, a two-stage framework that first analyzes each participant’s nonverbal cues and emotions, then infers group- level engagement using instruction-tuned representations. To mitigate the lack of individual annotations in group datasets, we employ knowledge distillation to transfer supervision signals. Experiments on the OUC-CGE dataset show that Social- Qwen significantly outperforms prior end-to-end baselines and achieves state-of-the-art performance in group engagement analysis, demonstrating the promise of instruction tuning for scalable social intelligence in robots. We further evaluate robustness by testing generalization to (1) an in-house dataset spanning multiple social activities and (2) estimating other social dynamics such as group harmony. Results suggest con- sistent performance, highlighting Social-Qwen as a promising approach toward real-time social intelligence for intelligent agents.

Index terms

Computer Vision for Automation Big Data in Robotics and Automation Transfer Learning