Cauchy Diffusion: A Heavy-tailed Denoising Diffusion Probabilistic Model for Speech Synthesis

Authors

  • Qi Lian The College of Computer Science and Technology, Zhejiang University
  • Yu Qi MOE Frontier Science Center for Brain Science and Brain-machine Integration, Zhejiang University Affiliated Mental Health Center & Hangzhou Seventh People’s Hospital, Zhejiang University The College of Computer Science and Technology, Zhejiang University State Key Lab of Brain-Machine Intelligence, Zhejiang University
  • Yueming Wang The College of Computer Science and Technology, Zhejiang University

DOI:

https://doi.org/10.1609/aaai.v39i23.34634

Abstract

Denoising diffusion probabilistic models (DDPMs) have gained popularity in devising neural vocoders and obtained outstanding performance. However, existing DDPM-based neural vocoders struggle to handle the prosody diversities due to their susceptibility to mode-collapse issues confronted with imbalanced data. We introduced Cauchy Diffusion, a model incorporating the Cauchy noises to address this challenge. The heavy-tailed Cauchy distribution exhibits better resilience to imbalanced speech data, potentially improving prosody modeling. Our experiments on the LJSpeech and VCTK datasets demonstrate that Cauchy Diffusion achieved state-of-the-art speech synthesis performance. Compared to existing neural vocoders, our Cauchy Diffusion notably improved speech diversity while maintaining superior speech quality. Remarkably, Cauchy Diffusion surpassed neural vocoders based on generative adversarial networks (GANs) that are explicitly optimized to improve diversity.

Downloads

Published

2025-04-11

How to Cite

Lian, Q., Qi, Y., & Wang, Y. (2025). Cauchy Diffusion: A Heavy-tailed Denoising Diffusion Probabilistic Model for Speech Synthesis. Proceedings of the AAAI Conference on Artificial Intelligence, 39(23), 24549–24557. https://doi.org/10.1609/aaai.v39i23.34634

Issue

Section

AAAI Technical Track on Natural Language Processing II