← Back ICRA 2026

3DFacePolicy: Speech-Driven 3D Facial Animation Based on Diffusion Policy

Xuanmeng Sha, Liyun Zhang, Tomohiro Mashita, Naoya Chiba, Yuki Uranishi

PDF

AI summary

Key figure (auto-extracted from paper)

Reformulating speech-driven 3D facial animation as a vertex trajectory control problem using a robotic diffusion policy yields significantly smoother, more continuous, and expressive facial animations than existing methods.

Speech-driven animation 3D facial animation diffusion policy vertex trajectory control digital humans robotic imitation learning

Problem

Existing speech-driven 3D facial animation methods struggle with discontinuous, vague, or unnatural movements due to deterministic frame-by-frame generation or high-noise diffusion approaches that overlook smooth vertex trajectory modeling.

Approach

3DFacePolicy defines facial motion as discrete 'actions' representing frame-to-frame vertex displacements and uses a robotic-inspired diffusion policy to predict these actions conditioned on audio and vertex states, accumulating them to control smooth trajectories.

Key results

Achieves state-of-the-art performance on VOCASET and BIWI datasets across MVE, FDD, and UFVE metrics.
Demonstrates superior lip-sync accuracy, realism, and emotional expression in user studies.
Introduces a novel action-based control framework that redefines facial animation synthesis as vertex trajectory control.
Validates that smoother vertex motion trajectories directly correlate with more realistic and natural facial animations.

Why it matters

Provides a new cross-domain paradigm for digital human generation, benefiting virtual avatars, AI assistants, and robotics by enabling highly natural and continuous speech-driven facial expressions.

Abstract

Speech-driven 3D facial animation has achieved significant progress in both research and applications. While recent baselines struggle to generate natural and continuous facial movements due to their frame-by-frame vertex gen- eration approach, we propose 3DFacePolicy, a pioneer work that introduces a novel definition of vertex trajectory changes across consecutive frames through the concept of “action”. By predicting action sequences for each vertex that encode frame-to-frame movements, we reformulate vertex generation approach into an action-based control paradigm. Specifically, we leverage a robotic control mechanism, diffusion policy, to predict action sequences conditioned on both audio and vertex states. Extensive experiments on VOCASET and BIWI datasets demonstrate that our approach significantly outperforms state- of-the-art methods and is particularly expert in dynamic, expressive and naturally smooth facial animations.

Index terms

Computer Vision for Automation Motion and Path Planning Imitation Learning