Emotion-Conditioned Motion Sub-spaces with Flow Matching for Real-Time Audio-Driven Talking Heads

Haoyu Wang; Xiaozhe Xin; Xiaoyu Qin; Meiguang Jin; Junfeng Ma; Dan Xu; Jia Jia

doi:10.1609/aaai.v40i21.38834

Authors

Haoyu Wang Department of Computer Science and Technology, Tsinghua University
Xiaozhe Xin Alibaba Group
Xiaoyu Qin Department of Computer Science and Technology, Tsinghua University
Meiguang Jin Alibaba Group
Junfeng Ma Alibaba Group
Dan Xu Hong Kong University of Science and Technology
Jia Jia Department of Computer Science and Technology, Tsinghua University Key Laboratory of Pervasive Computing, Ministry of Education BNRist, Tsinghua University

DOI:

https://doi.org/10.1609/aaai.v40i21.38834

Abstract

Recent advances in audio-driven talking-head synthesis have brought lip-sync precision close to human perception, yet emotional fidelity and real-time inference remain open challenges. Existing pipelines typically disentangle lip articulation, facial expression, and head pose in latent space; this rigid factorization ignores the intrinsic coupling between articulation and affect — e.g., downward lip corners when sad—thus limiting expressiveness. We cast speech-conditioned facial motion as a sample from an emotion-conditioned distribution in a motion latent space. Concretely, we (i) learn a motion dictionary of orthogonal bases with an autoencoder via self-supervision, (ii) construct emotion-conditioned sub-spaces within the latent space, and (iii) design a layer-progressive cross-attention fusion module that modulates a flow-matching sampler with both audio and emotion signals. Only ten reverse ODE steps are required to generate a motion-latent trajectory, enabling real-time end-to-end latency. Extensive experiments on MEAD and RAVDESS show that our method outperforms recent GAN- and diffusion-based baselines in emotion accuracy while running at around 75 FPS on a single desktop GPU. The proposed framework delivers the first emotionally expressive Audio2Face system that simultaneously achieves lip-sync accuracy, affective realism, and real-time performance.

Emotion-Conditioned Motion Sub-spaces with Flow Matching for Real-Time Audio-Driven Talking Heads

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information