RobuTrans: A Robust Transformer-Based Text-to-Speech Model

Naihan Li; Yanqing Liu; Yu Wu; Shujie Liu; Sheng Zhao; Ming Liu

doi:10.1609/aaai.v34i05.6337

Authors

Naihan Li University of Electronic Science and Techonoloy of China
Yanqing Liu Microsoft
Yu Wu Microsoft
Shujie Liu Microsoft
Sheng Zhao Microsoft
Ming Liu University of Electronic Science and Techonoloy of China

DOI:

https://doi.org/10.1609/aaai.v34i05.6337

Abstract

Recently, neural network based speech synthesis has achieved outstanding results, by which the synthesized audios are of excellent quality and naturalness. However, current neural TTS models suffer from the robustness issue, which results in abnormal audios (bad cases) especially for unusual text (unseen context). To build a neural model which can synthesize both natural and stable audios, in this paper, we make a deep analysis of why the previous neural TTS models are not robust, based on which we propose RobuTrans (Robust Transformer), a robust neural TTS model based on Transformer. Comparing to TransformerTTS, our model first converts input texts to linguistic features, including phonemic features and prosodic features, then feed them to the encoder. In the decoder, the encoder-decoder attention is replaced with a duration-based hard attention mechanism, and the causal self-attention is replaced with a "pseudo non-causal attention" mechanism to model the holistic information of the input. Besides, the position embedding is replaced with a 1-D CNN, since it constrains the maximum length of synthesized audio. With these modifications, our model not only fix the robustness problem, but also achieves on parity MOS (4.36) with TransformerTTS (4.37) and Tacotron2 (4.37) on our general set.

RobuTrans: A Robust Transformer-Based Text-to-Speech Model

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information