TaLNet: Voice Reconstruction from Tongue and Lip Articulation with Transfer Learning from Text-to-Speech Synthesis

Authors

  • Jing-Xuan Zhang National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, P. R. China The Center for Speech Technology Research, University of Edinburgh, UK
  • Korin Richmond The Center for Speech Technology Research, University of Edinburgh, UK
  • Zhen-Hua Ling National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, P. R. China
  • Lirong Dai National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, P. R. China

DOI:

https://doi.org/10.1609/aaai.v35i16.17693

Keywords:

Speech Synthesis

Abstract

This paper presents TaLNet, a model for voice reconstruction with ultrasound tongue and optical lip videos as inputs. TaLNet is based on an encoder-decoder architecture. Separate encoders are dedicated to processing the tongue and lip data streams respectively. The decoder predicts acoustic features conditioned on encoder outputs and speaker codes.To mitigate for having only relatively small amounts of dual articulatory-acoustic data available for training, and since our task here shares with text-to-speech (TTS) the common goal of speech generation, we propose a novel transfer learning strategy to exploit the much larger amounts of acoustic-only data available to train TTS models. For this, a Tacotron 2 TTS model is first trained, and then the parameters of its decoder are transferred to the TaLNet decoder. We have evaluated our approach on an unconstrained multi-speaker voice recovery task. Our results show the effectiveness of both the proposed model and the transfer learning strategy. Speech reconstructed using our proposed method significantly outperformed all baselines (DNN, BLSTM and without transfer learning) in terms of both naturalness and intelligibility. When using an ASR model decoding the recovery speech, the WER of our proposed method is relatively reduced over 30% compared to baselines.

Downloads

Published

2021-05-18

How to Cite

Zhang, J.-X., Richmond, K., Ling, Z.-H., & Dai, L. (2021). TaLNet: Voice Reconstruction from Tongue and Lip Articulation with Transfer Learning from Text-to-Speech Synthesis. Proceedings of the AAAI Conference on Artificial Intelligence, 35(16), 14402-14410. https://doi.org/10.1609/aaai.v35i16.17693

Issue

Section

AAAI Technical Track on Speech and Natural Language Processing III