Emotion-Conditioned Motion Sub-spaces with Flow Matching for Real-Time Audio-Driven Talking Heads

Authors

  • Haoyu Wang Department of Computer Science and Technology, Tsinghua University
  • Xiaozhe Xin Alibaba Group
  • Xiaoyu Qin Department of Computer Science and Technology, Tsinghua University
  • Meiguang Jin Alibaba Group
  • Junfeng Ma Alibaba Group
  • Dan Xu Hong Kong University of Science and Technology
  • Jia Jia Department of Computer Science and Technology, Tsinghua University Key Laboratory of Pervasive Computing, Ministry of Education BNRist, Tsinghua University

DOI:

https://doi.org/10.1609/aaai.v40i21.38834

Abstract

Recent advances in audio-driven talking-head synthesis have brought lip-sync precision close to human perception, yet emotional fidelity and real-time inference remain open challenges. Existing pipelines typically disentangle lip articulation, facial expression, and head pose in latent space; this rigid factorization ignores the intrinsic coupling between articulation and affect — e.g., downward lip corners when sad—thus limiting expressiveness. We cast speech-conditioned facial motion as a sample from an emotion-conditioned distribution in a motion latent space. Concretely, we (i) learn a motion dictionary of orthogonal bases with an autoencoder via self-supervision, (ii) construct emotion-conditioned sub-spaces within the latent space, and (iii) design a layer-progressive cross-attention fusion module that modulates a flow-matching sampler with both audio and emotion signals. Only ten reverse ODE steps are required to generate a motion-latent trajectory, enabling real-time end-to-end latency. Extensive experiments on MEAD and RAVDESS show that our method outperforms recent GAN- and diffusion-based baselines in emotion accuracy while running at around 75 FPS on a single desktop GPU. The proposed framework delivers the first emotionally expressive Audio2Face system that simultaneously achieves lip-sync accuracy, affective realism, and real-time performance.

Published

2026-03-14

How to Cite

Wang, H., Xin, X., Qin, X., Jin, M., Ma, J., Xu, D., & Jia, J. (2026). Emotion-Conditioned Motion Sub-spaces with Flow Matching for Real-Time Audio-Driven Talking Heads. Proceedings of the AAAI Conference on Artificial Intelligence, 40(21), 17769–17777. https://doi.org/10.1609/aaai.v40i21.38834

Issue

Section

AAAI Technical Track on Humans and AI