MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement

Xinyue Yu; Youqing Fang; Pingyu Wu; Guoyang Ye; Wenbo Zhou; Weiming Zhang; Song Xiao

doi:10.1609/aaai.v40i21.38856

Authors

Xinyue Yu University of Science and Technology of China
Youqing Fang University of Science and Technology of China
Pingyu Wu University of Science and Technology of China
Guoyang Ye University of Science and Technology of China
Wenbo Zhou University of Science and Technology of China
Weiming Zhang University of Science and Technology of China
Song Xiao Beijing Electronic Science and Technology Institute

DOI:

https://doi.org/10.1609/aaai.v40i21.38856

Abstract

Generating expressive and controllable human speech is one of the core goals of generative artificial intelligence, but its progress has long been constrained by two fundamental challenges: the deep entanglement of speech factors and the coarse granularity of existing control mechanisms. To overcome these challenges, we have proposed a novel framework called MF-Speech, which consists of two core components: MF-SpeechEncoder and MF-SpeechGenerator. MF-SpeechEncoder acts as a factor purifier, adopting a multi-objective optimization strategy to decompose the original speech signal into highly pure and independent representations of content, timbre, and emotion. Subsequently, MF-SpeechGenerator functions as a conductor, achieving precise, composable and fine-grained control over these factors through dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN). Experiments demonstrate that in the highly challenging multi-factor compositional speech generation task, MF-Speech significantly outperforms current state-of-the-art methods, achieving a lower word error rate (WER=4.67%), superior style control (SECS=0.5685, Corr=0.68), and the highest subjective evaluation scores (nMOS=3.96, sMOS_t=3.86, sMOS_e=3.78). Furthermore, the learned discrete factors exhibit strong transferability, demonstrating their significant potential as a general-purpose speech representation.

MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information