Speech Recognition Model Improves Text-to-Speech Synthesis Using Fine-Grained Reward
DOI:
https://doi.org/10.1609/aaai.v40i39.40631Abstract
Recent advancements in Text-to-Speech (TTS) technology have been remarkable, enabling current models to clone arbitrary unseen speakers and synthesize high-quality, natural-sounding speech. However, corresponding evaluation techniques appear to be lagging: existing Mean Opinion Score (MOS) estimation models typically perform regression-based scoring on entire speech segments, while a failed synthesized speech usually contains problematic elements in only a few isolated words rather than throughout the entire utterance. In this context, we present an intriguing finding: encoder-decoder ASR models, such as Whisper, leverage their extensive pre-training to precisely capture word-level mismatches between speech and text within their cross-attention mechanisms, thereby providing a fine-grained reward signal. Building upon this insight, we propose a novel TTS optimization method, which we term Word-level TTS Alignment by ASR-driven Attentive Reward (W3AR). Instead of relying on any explicit reward annotations, W3AR leverages the attention information within a pre-trained ASR model, enabling finer-grained alignment and optimization of the sequences predicted by the TTS model. Experimental results demonstrate that W3AR not only effectively improves the TTS generation quality of existing models but also further enhances zero-shot robustness based on both in-domain and out-of-domain prompt speakers. Additionally, our findings and proposed methodology offer a new insight for generative tasks: understanding models can potentially serve as evaluators, providing highly fine-grained and valuable feedback for generative optimization.Downloads
Published
2026-03-14
How to Cite
Wang, G., & Sun, P. (2026). Speech Recognition Model Improves Text-to-Speech Synthesis Using Fine-Grained Reward. Proceedings of the AAAI Conference on Artificial Intelligence, 40(39), 33440–33448. https://doi.org/10.1609/aaai.v40i39.40631
Issue
Section
AAAI Technical Track on Natural Language Processing IV