Hierarchically Controlled Deformable 3D Gaussians for Talking Head Synthesis

Authors

  • Zhenhua Wu Sun Yat-sen University Shanghai Innovation Institute
  • Linxuan Jiang Guangdong University of Technology
  • Xiang Li Gezhi Intelligent Technology
  • Chaowei Fang Xidian University
  • Yipeng Qin Cardiff University
  • Guanbin Li Sun Yat-sen University Peng Cheng Laboratory Guangdong Key Laboratory of Big Data Analysis and Processing

DOI:

https://doi.org/10.1609/aaai.v39i8.32921

Abstract

Audio-driven talking head synthesis is a critical task in digital human modeling. While recent advances using diffusion models and Neural Radiance Fields (NeRF) have improved visual quality, they often require substantial computational resources, limiting practical deployment. We present a novel framework for audio-driven talking head synthesis, namely it Hierarchically Controlled Deformable 3D Gaussians (HiCoDe), which achieves state-of-the-art performance with significantly reduced computational costs. Our key contribution is a hierarchical control strategy that effectively bridges the gap between sparse audio features and dense 3D Gaussian point clouds. Specifically, this strategy comprises two control levels: i) coarse-level control based on a 3D Morphable Model (3DMM) and ii) fine-level control using facial landmarks. Extensive experiments on the HDTF dataset and additional test sets demonstrate that our method outperforms existing approaches in visual quality, facial landmark accuracy, and audio-visual synchronization while being more computationally efficient in both training and inference.

Downloads

Published

2025-04-11

How to Cite

Wu, Z., Jiang, L., Li, X., Fang, C., Qin, Y., & Li, G. (2025). Hierarchically Controlled Deformable 3D Gaussians for Talking Head Synthesis. Proceedings of the AAAI Conference on Artificial Intelligence, 39(8), 8532-8540. https://doi.org/10.1609/aaai.v39i8.32921

Issue

Section

AAAI Technical Track on Computer Vision VII