Speech Recognition Model Improves Text-to-Speech Synthesis Using Fine-Grained Reward

Guansu Wang; Peijie Sun

doi:10.1609/aaai.v40i39.40631

Authors

Guansu Wang University of Melbourne
Peijie Sun Nanjing University of Posts and Telecommunications

DOI:

https://doi.org/10.1609/aaai.v40i39.40631

Abstract

Recent advancements in Text-to-Speech (TTS) technology have been remarkable, enabling current models to clone arbitrary unseen speakers and synthesize high-quality, natural-sounding speech. However, corresponding evaluation techniques appear to be lagging: existing Mean Opinion Score (MOS) estimation models typically perform regression-based scoring on entire speech segments, while a failed synthesized speech usually contains problematic elements in only a few isolated words rather than throughout the entire utterance. In this context, we present an intriguing finding: encoder-decoder ASR models, such as Whisper, leverage their extensive pre-training to precisely capture word-level mismatches between speech and text within their cross-attention mechanisms, thereby providing a fine-grained reward signal. Building upon this insight, we propose a novel TTS optimization method, which we term Word-level TTS Alignment by ASR-driven Attentive Reward (W3AR). Instead of relying on any explicit reward annotations, W3AR leverages the attention information within a pre-trained ASR model, enabling finer-grained alignment and optimization of the sequences predicted by the TTS model. Experimental results demonstrate that W3AR not only effectively improves the TTS generation quality of existing models but also further enhances zero-shot robustness based on both in-domain and out-of-domain prompt speakers. Additionally, our findings and proposed methodology offer a new insight for generative tasks: understanding models can potentially serve as evaluators, providing highly fine-grained and valuable feedback for generative optimization.

Speech Recognition Model Improves Text-to-Speech Synthesis Using Fine-Grained Reward

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information