EA-VAE: Learning to Reconstruct Dysarthric Speech via Variational Autoencoder with Encoding Alignment

Daipeng Zhang; Wenhuan Lu; Xianghu Yue; Hongcheng Zhang; Jianguo Wei

doi:10.1609/aaai.v40i41.40766

Authors

Daipeng Zhang School of New Media and Communication, Tianjin University, Tianjin, China
Wenhuan Lu School of New Media and Communication, Tianjin University, Tianjin, China College of Intelligence and Computing, Tianjin University, Tianjin, China School of Intelligence Science and Engineering, Qinghai Minzu University, Xining, China
Xianghu Yue College of Intelligence and Computing, Tianjin University, Tianjin, China
Hongcheng Zhang College of Intelligence and Computing, Tianjin University, Tianjin, China
Jianguo Wei School of New Media and Communication, Tianjin University, Tianjin, China College of Intelligence and Computing, Tianjin University, Tianjin, China School of Intelligence Science and Engineering, Qinghai Minzu University, Xining, China

DOI:

https://doi.org/10.1609/aaai.v40i41.40766

Abstract

Dysarthric speech reconstruction (DSR) aims to enhance the intelligibility of dysarthric speech. Compared with normal speech, the dysarthric speech is characterized by its pathological features, including discontinuous pronunciation, slow speech, hoarseness, and improper pauses. Significant disparities in the feature space between normal and dysarthric speech may result in suboptimal speech reconstruction, thereby degrading speech intelligibility. To enhance the reconstruction ability of speech feature spaces, this paper proposes a DSR model named the Encoding-Aligned Variational Autoencoder (EA-VAE). By incorporating alignment modules of frame-level embedding features, prior distributions, and duration into the encoder of the VAE, the model explicitly aligns the dysarthric speech encoding with a representation of the parallel normal speech. A shared decoder is then used to generate speech with improved intelligibility. Experimental results on the UASpeech benchmark confirm that EA-VAE achieves state-of-the-art performance, with a 31.7% relative word error rate reduction and the highest subjective MOS score (4.48), thoroughly validating the effectiveness and advancements of the proposed method in dysarthric speech reconstruction.

EA-VAE: Learning to Reconstruct Dysarthric Speech via Variational Autoencoder with Encoding Alignment

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information