SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis

Authors

  • Yifan Liang Institute of Acoustics, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China
  • Andong Li Institute of Acoustics, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China
  • Kang Yang School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
  • Guochen Yu Zhipu
  • Fangkun Liu Institute of Acoustics, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China
  • Lingling Dai Institute of Acoustics, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China
  • Xiaodong Li Institute of Acoustics, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China
  • Chengshi Zheng Institute of Acoustics, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v40i38.40464

Abstract

Although lip-to-speech synthesis (L2S) has achieved significant progress in recent years, current state-of-the-art methods typically rely on intermediate representations such as mel-spectrograms or discrete self-supervised learning (SSL) tokens. The potential of latent diffusion models (LDMs) in this task remains largely unexplored. In this paper, we introduce SLD-L2S, a novel L2S framework built upon a hierarchical subspace latent diffusion model. Our method aims to directly map visual lip movements to the continuous latent space of a pre-trained neural audio codec, thereby avoiding the information loss inherent in traditional intermediate representations. The core of our method is a hierarchical architecture that processes visual representations through multiple parallel subspaces, initiated by a subspace decomposition module. To efficiently enhance interactions within and between these subspaces, we design the diffusion convolution block (DiCB) as our network backbone. Furthermore, we employ a reparameterized flow matching technique to directly generate the target latent vectors. This enables a principled inclusion of speech language model (SLM) and semantic losses during training, moving beyond conventional flow matching objectives and improving synthesized speech quality. Our experiments show that SLD-L2S achieves state-of-the-art generation quality on multiple benchmark datasets, surpassing existing methods in both objective and subjective evaluations.

Downloads

Published

2026-03-14

How to Cite

Liang, Y., Li, A., Yang, K., Yu, G., Liu, F., Dai, L., … Zheng, C. (2026). SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis. Proceedings of the AAAI Conference on Artificial Intelligence, 40(38), 31943–31951. https://doi.org/10.1609/aaai.v40i38.40464

Issue

Section

AAAI Technical Track on Natural Language Processing III