EA-VAE: Learning to Reconstruct Dysarthric Speech via Variational Autoencoder with Encoding Alignment

Authors

  • Daipeng Zhang School of New Media and Communication, Tianjin University, Tianjin, China
  • Wenhuan Lu School of New Media and Communication, Tianjin University, Tianjin, China College of Intelligence and Computing, Tianjin University, Tianjin, China School of Intelligence Science and Engineering, Qinghai Minzu University, Xining, China
  • Xianghu Yue College of Intelligence and Computing, Tianjin University, Tianjin, China
  • Hongcheng Zhang College of Intelligence and Computing, Tianjin University, Tianjin, China
  • Jianguo Wei School of New Media and Communication, Tianjin University, Tianjin, China College of Intelligence and Computing, Tianjin University, Tianjin, China School of Intelligence Science and Engineering, Qinghai Minzu University, Xining, China

DOI:

https://doi.org/10.1609/aaai.v40i41.40766

Abstract

Dysarthric speech reconstruction (DSR) aims to enhance the intelligibility of dysarthric speech. Compared with normal speech, the dysarthric speech is characterized by its pathological features, including discontinuous pronunciation, slow speech, hoarseness, and improper pauses. Significant disparities in the feature space between normal and dysarthric speech may result in suboptimal speech reconstruction, thereby degrading speech intelligibility. To enhance the reconstruction ability of speech feature spaces, this paper proposes a DSR model named the Encoding-Aligned Variational Autoencoder (EA-VAE). By incorporating alignment modules of frame-level embedding features, prior distributions, and duration into the encoder of the VAE, the model explicitly aligns the dysarthric speech encoding with a representation of the parallel normal speech. A shared decoder is then used to generate speech with improved intelligibility. Experimental results on the UASpeech benchmark confirm that EA-VAE achieves state-of-the-art performance, with a 31.7% relative word error rate reduction and the highest subjective MOS score (4.48), thoroughly validating the effectiveness and advancements of the proposed method in dysarthric speech reconstruction.

Downloads

Published

2026-03-14

How to Cite

Zhang, D., Lu, W., Yue, X., Zhang, H., & Wei, J. (2026). EA-VAE: Learning to Reconstruct Dysarthric Speech via Variational Autoencoder with Encoding Alignment. Proceedings of the AAAI Conference on Artificial Intelligence, 40(41), 34656–34664. https://doi.org/10.1609/aaai.v40i41.40766

Issue

Section

AAAI Technical Track on Natural Language Processing VI