VQTalker: Towards Multilingual Talking Avatars Through Facial Motion Tokenization

Tao Liu; Ziyang Ma; Qi Chen; Feilong Chen; Shuai Fan; Xie Chen; Kai Yu

doi:10.1609/aaai.v39i6.32595

Authors

Tao Liu Shanghai Jiao Tong University
Ziyang Ma Shanghai Jiao Tong University
Qi Chen Shanghai Jiao Tong University
Feilong Chen AISpeech Ltd
Shuai Fan AISpeech Ltd
Xie Chen Shanghai Jiao Tong University
Kai Yu Shanghai Jiao Tong University

DOI:

https://doi.org/10.1609/aaai.v39i6.32595

Abstract

We present VQTalker, a Vector Quantization-based framework for multilingual talking head generation that addresses the challenges of lip synchronization and natural motion across diverse languages. Our approach is grounded in the phonetic principle that human speech comprises a finite set of distinct sound units (phonemes) and corresponding visual articulations (visemes), which often share commonalities across languages. We introduce a facial motion tokenizer based on Group Residual Finite Scalar Quantization (GRFSQ), which creates a discretized representation of facial features. This method enables comprehensive capture of facial movements while improving generalization to multiple languages, even with limited training data. Building on this quantized representation, we implement a coarse-to-fine motion generation process that progressively refines facial animations. Extensive experiments demonstrate that VQTalker achieves state-of-the-art performance in both video-driven and speech-driven scenarios, particularly in multilingual settings. Notably, our method achieves high-quality results at a resolution of 512 × 512 pixels while maintaining a lower bitrate of approximately 11 kbps. Our work opens new possibilities for cross-lingual talking face generation.

VQTalker: Towards Multilingual Talking Avatars Through Facial Motion Tokenization

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information