Cauchy Diffusion: A Heavy-tailed Denoising Diffusion Probabilistic Model for Speech Synthesis

Qi Lian; Yu Qi; Yueming Wang

doi:10.1609/aaai.v39i23.34634

Authors

Qi Lian The College of Computer Science and Technology, Zhejiang University
Yu Qi MOE Frontier Science Center for Brain Science and Brain-machine Integration, Zhejiang University Affiliated Mental Health Center & Hangzhou Seventh People’s Hospital, Zhejiang University The College of Computer Science and Technology, Zhejiang University State Key Lab of Brain-Machine Intelligence, Zhejiang University
Yueming Wang The College of Computer Science and Technology, Zhejiang University

DOI:

https://doi.org/10.1609/aaai.v39i23.34634

Abstract

Denoising diffusion probabilistic models (DDPMs) have gained popularity in devising neural vocoders and obtained outstanding performance. However, existing DDPM-based neural vocoders struggle to handle the prosody diversities due to their susceptibility to mode-collapse issues confronted with imbalanced data. We introduced Cauchy Diffusion, a model incorporating the Cauchy noises to address this challenge. The heavy-tailed Cauchy distribution exhibits better resilience to imbalanced speech data, potentially improving prosody modeling. Our experiments on the LJSpeech and VCTK datasets demonstrate that Cauchy Diffusion achieved state-of-the-art speech synthesis performance. Compared to existing neural vocoders, our Cauchy Diffusion notably improved speech diversity while maintaining superior speech quality. Remarkably, Cauchy Diffusion surpassed neural vocoders based on generative adversarial networks (GANs) that are explicitly optimized to improve diversity.

Cauchy Diffusion: A Heavy-tailed Denoising Diffusion Probabilistic Model for Speech Synthesis

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information