SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis

Yifan Liang; Andong Li; Kang Yang; Guochen Yu; Fangkun Liu; Lingling Dai; Xiaodong Li; Chengshi Zheng

doi:10.1609/aaai.v40i38.40464

Authors

Yifan Liang Institute of Acoustics, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China
Andong Li Institute of Acoustics, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China
Kang Yang School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
Guochen Yu Zhipu
Fangkun Liu Institute of Acoustics, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China
Lingling Dai Institute of Acoustics, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China
Xiaodong Li Institute of Acoustics, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China
Chengshi Zheng Institute of Acoustics, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v40i38.40464

Abstract

Although lip-to-speech synthesis (L2S) has achieved significant progress in recent years, current state-of-the-art methods typically rely on intermediate representations such as mel-spectrograms or discrete self-supervised learning (SSL) tokens. The potential of latent diffusion models (LDMs) in this task remains largely unexplored. In this paper, we introduce SLD-L2S, a novel L2S framework built upon a hierarchical subspace latent diffusion model. Our method aims to directly map visual lip movements to the continuous latent space of a pre-trained neural audio codec, thereby avoiding the information loss inherent in traditional intermediate representations. The core of our method is a hierarchical architecture that processes visual representations through multiple parallel subspaces, initiated by a subspace decomposition module. To efficiently enhance interactions within and between these subspaces, we design the diffusion convolution block (DiCB) as our network backbone. Furthermore, we employ a reparameterized flow matching technique to directly generate the target latent vectors. This enables a principled inclusion of speech language model (SLM) and semantic losses during training, moving beyond conventional flow matching objectives and improving synthesized speech quality. Our experiments show that SLD-L2S achieves state-of-the-art generation quality on multiple benchmark datasets, surpassing existing methods in both objective and subjective evaluations.

SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information