KALL-E: Autoregressive Speech Synthesis with Next-Distribution Prediction

Authors

  • Kangxiang Xia Audio, Speech and Language Processing Group (ASLP@NPU),Northwestern Polytechnical University, Xi’an
  • Xinfa Zhu Audio, Speech and Language Processing Group (ASLP@NPU),Northwestern Polytechnical University, Xi’an
  • Jixun Yao Audio, Speech and Language Processing Group (ASLP@NPU),Northwestern Polytechnical University, Xi’an
  • Wenjie Tian Audio, Speech and Language Processing Group (ASLP@NPU),Northwestern Polytechnical University, Xi’an
  • Wenhao Li Audio, Speech and Language Processing Group (ASLP@NPU),Northwestern Polytechnical University, Xi’an
  • Lei Xie Audio, Speech and Language Processing Group (ASLP@NPU),Northwestern Polytechnical University, Xi’an

DOI:

https://doi.org/10.1609/aaai.v40i40.40695

Abstract

We introduce KALL-E, a novel autoregressive (AR) language model for text-to-speech (TTS) synthesis that operates by predicting the next distribution of continuous speech frames. Unlike existing methods, KALL-E directly models the continuous speech distribution conditioned on text, eliminating the need for any diffusion-based components. Specifically, we utilize a Flow-VAE to extract a continuous latent speech representation from waveforms, instead of relying on discrete speech tokens. A single AR Transformer is then trained to predict these continuous speech distributions from text, optimizing a Kullback–Leibler divergence loss as its objective. Experimental results demonstrate that KALL-E achieves superior speech synthesis quality and can even adapt to a target speaker from just a single sample. Importantly, KALL-E provides a more direct and effective approach for utilizing continuous speech representations in TTS.

Downloads

Published

2026-03-14

How to Cite

Xia, K., Zhu, X., Yao, J., Tian, W., Li, W., & Xie, L. (2026). KALL-E: Autoregressive Speech Synthesis with Next-Distribution Prediction. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40), 34016–34024. https://doi.org/10.1609/aaai.v40i40.40695

Issue

Section

AAAI Technical Track on Natural Language Processing V