MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement

Authors

  • Xinyue Yu University of Science and Technology of China
  • Youqing Fang University of Science and Technology of China
  • Pingyu Wu University of Science and Technology of China
  • Guoyang Ye University of Science and Technology of China
  • Wenbo Zhou University of Science and Technology of China
  • Weiming Zhang University of Science and Technology of China
  • Song Xiao Beijing Electronic Science and Technology Institute

DOI:

https://doi.org/10.1609/aaai.v40i21.38856

Abstract

Generating expressive and controllable human speech is one of the core goals of generative artificial intelligence, but its progress has long been constrained by two fundamental challenges: the deep entanglement of speech factors and the coarse granularity of existing control mechanisms. To overcome these challenges, we have proposed a novel framework called MF-Speech, which consists of two core components: MF-SpeechEncoder and MF-SpeechGenerator. MF-SpeechEncoder acts as a factor purifier, adopting a multi-objective optimization strategy to decompose the original speech signal into highly pure and independent representations of content, timbre, and emotion. Subsequently, MF-SpeechGenerator functions as a conductor, achieving precise, composable and fine-grained control over these factors through dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN). Experiments demonstrate that in the highly challenging multi-factor compositional speech generation task, MF-Speech significantly outperforms current state-of-the-art methods, achieving a lower word error rate (WER=4.67%), superior style control (SECS=0.5685, Corr=0.68), and the highest subjective evaluation scores (nMOS=3.96, sMOS_t=3.86, sMOS_e=3.78). Furthermore, the learned discrete factors exhibit strong transferability, demonstrating their significant potential as a general-purpose speech representation.

Published

2026-03-14

How to Cite

Yu, X., Fang, Y., Wu, P., Ye, G., Zhou, W., Zhang, W., & Xiao, S. (2026). MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement. Proceedings of the AAAI Conference on Artificial Intelligence, 40(21), 17966–17974. https://doi.org/10.1609/aaai.v40i21.38856

Issue

Section

AAAI Technical Track on Humans and AI