READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation
DOI:
https://doi.org/10.1609/aaai.v40i12.37940Abstract
The introduction of diffusion models has brought significant advances to the field of audio-driven talking head generation. However, the extremely slow inference speed severely limits the practical implementation of diffusion-based talking head generation models. In this study, we propose READ, a real-time diffusion-transformer-based talking head generation framework. Our approach first learns a spatiotemporal highly compressed video latent space via a temporal VAE, significantly reducing the token count to accelerate generation. To achieve better audio-visual alignment within this compressed latent space, a pre-trained Speech Autoencoder (SpeechAE) is proposed to generate temporally compressed speech latent codes corresponding to the video latent space. These latent representations are then modeled by a carefully designed Audio-to-Video Diffusion Transformer (A2V-DiT) backbone for efficient talking head synthesis. Furthermore, to ensure temporal consistency and accelerated inference in extended generation, we propose a novel asynchronous noise scheduler (ANS) for both the training and inference processes of our framework. The ANS leverages asynchronous add-noise and asynchronous motion-guided generation in the latent space, ensuring consistency in generated video clips. Experimental results demonstrate that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime, achieving an optimal balance between quality and speed while maintaining robust metric stability in long-time generation.Downloads
Published
2026-03-14
How to Cite
Wang, H., Weng, Y., Du, J., Xu, H., Wu, X., He, S., … Liu, Q. (2026). READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(12), 9766–9774. https://doi.org/10.1609/aaai.v40i12.37940
Issue
Section
AAAI Technical Track on Computer Vision IX