Zero-Shot Face-Based Voice Conversion: Bottleneck-Free Speech Disentanglement in the Real-World Scenario

Shao-En Weng; Hong-Han Shuai; Wen-Huang Cheng

doi:10.1609/aaai.v37i11.26607

Authors

Shao-En Weng National Yang Ming Chiao Tung University, Hsinchu, Taiwan
Hong-Han Shuai National Yang Ming Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng National Yang Ming Chiao Tung University, Hsinchu, Taiwan

DOI:

https://doi.org/10.1609/aaai.v37i11.26607

Keywords:

SNLP: Generation, SNLP: Speech and Multimodality, ML: Multimodal Learning

Abstract

Often a face has a voice. Appearance sometimes has a strong relationship with one's voice. In this work, we study how a face can be converted to a voice, which is a face-based voice conversion. Since there is no clean dataset that contains face and speech, voice conversion faces difficult learning and low-quality problems caused by background noise or echo. Too much redundant information for face-to-voice also causes synthesis of a general style of speech. Furthermore, previous work tried to disentangle speech with bottleneck adjustment. However, it is hard to decide on the size of the bottleneck. Therefore, we propose a bottleneck-free strategy for speech disentanglement. To avoid synthesizing the general style of speech, we utilize framewise facial embedding. It applied adversarial learning with a multi-scale discriminator for the model to achieve better quality. In addition, the self-attention module is added to focus on content-related features for in-the-wild data. Quantitative experiments show that our method outperforms previous work.

Zero-Shot Face-Based Voice Conversion: Bottleneck-Free Speech Disentanglement in the Real-World Scenario

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription