VQTalker: Towards Multilingual Talking Avatars Through Facial Motion Tokenization

Authors

  • Tao Liu Shanghai Jiao Tong University
  • Ziyang Ma Shanghai Jiao Tong University
  • Qi Chen Shanghai Jiao Tong University
  • Feilong Chen AISpeech Ltd
  • Shuai Fan AISpeech Ltd
  • Xie Chen Shanghai Jiao Tong University
  • Kai Yu Shanghai Jiao Tong University

DOI:

https://doi.org/10.1609/aaai.v39i6.32595

Abstract

We present VQTalker, a Vector Quantization-based framework for multilingual talking head generation that addresses the challenges of lip synchronization and natural motion across diverse languages. Our approach is grounded in the phonetic principle that human speech comprises a finite set of distinct sound units (phonemes) and corresponding visual articulations (visemes), which often share commonalities across languages. We introduce a facial motion tokenizer based on Group Residual Finite Scalar Quantization (GRFSQ), which creates a discretized representation of facial features. This method enables comprehensive capture of facial movements while improving generalization to multiple languages, even with limited training data. Building on this quantized representation, we implement a coarse-to-fine motion generation process that progressively refines facial animations. Extensive experiments demonstrate that VQTalker achieves state-of-the-art performance in both video-driven and speech-driven scenarios, particularly in multilingual settings. Notably, our method achieves high-quality results at a resolution of 512 × 512 pixels while maintaining a lower bitrate of approximately 11 kbps. Our work opens new possibilities for cross-lingual talking face generation.

Downloads

Published

2025-04-11

How to Cite

Liu, T., Ma, Z., Chen, Q., Chen, F., Fan, S., Chen, X., & Yu, K. (2025). VQTalker: Towards Multilingual Talking Avatars Through Facial Motion Tokenization. Proceedings of the AAAI Conference on Artificial Intelligence, 39(6), 5586–5594. https://doi.org/10.1609/aaai.v39i6.32595

Issue

Section

AAAI Technical Track on Computer Vision V