TaLNet: Voice Reconstruction from Tongue and Lip Articulation with Transfer Learning from Text-to-Speech Synthesis

Jing-Xuan Zhang; Korin Richmond; Zhen-Hua Ling; Lirong Dai

doi:10.1609/aaai.v35i16.17693

Authors

Jing-Xuan Zhang National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, P. R. China The Center for Speech Technology Research, University of Edinburgh, UK
Korin Richmond The Center for Speech Technology Research, University of Edinburgh, UK
Zhen-Hua Ling National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, P. R. China
Lirong Dai National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, P. R. China

DOI:

https://doi.org/10.1609/aaai.v35i16.17693

Keywords:

Speech Synthesis

Abstract

This paper presents TaLNet, a model for voice reconstruction with ultrasound tongue and optical lip videos as inputs. TaLNet is based on an encoder-decoder architecture. Separate encoders are dedicated to processing the tongue and lip data streams respectively. The decoder predicts acoustic features conditioned on encoder outputs and speaker codes.To mitigate for having only relatively small amounts of dual articulatory-acoustic data available for training, and since our task here shares with text-to-speech (TTS) the common goal of speech generation, we propose a novel transfer learning strategy to exploit the much larger amounts of acoustic-only data available to train TTS models. For this, a Tacotron 2 TTS model is first trained, and then the parameters of its decoder are transferred to the TaLNet decoder. We have evaluated our approach on an unconstrained multi-speaker voice recovery task. Our results show the effectiveness of both the proposed model and the transfer learning strategy. Speech reconstructed using our proposed method significantly outperformed all baselines (DNN, BLSTM and without transfer learning) in terms of both naturalness and intelligibility. When using an ASR model decoding the recovery speech, the WER of our proposed method is relatively reduced over 30% compared to baselines.

TaLNet: Voice Reconstruction from Tongue and Lip Articulation with Transfer Learning from Text-to-Speech Synthesis

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information