PGSS: Pitch-Guided Speech Separation

Xiang Li; Yiwen Wang; Yifan Sun; Xihong Wu; Jing Chen

doi:10.1609/aaai.v37i11.26542

Authors

Xiang Li Peking University
Yiwen Wang Peking University
Yifan Sun Peking University
Xihong Wu Peking University
Jing Chen Peking University

DOI:

https://doi.org/10.1609/aaai.v37i11.26542

Keywords:

SNLP: Speech and Multimodality, SNLP: Applications, SNLP: Other Foundations of Speech & Natural Language Processing

Abstract

Monaural speech separation aims to separate concurrent speakers from a single-microphone mixture recording. Inspired by the effect of pitch priming in auditory scene analysis (ASA) mechanisms, a novel pitch-guided speech separation framework is proposed in this work. The prominent advantage of this framework is that both the permutation problem and the unknown speaker number problem existing in general models can be avoided by using pitch contours as the primary means to guide the target speaker. In addition, adversarial training is applied, instead of a traditional time-frequency mask, to improve the perceptual quality of separated speech. Specifically, the proposed framework can be divided into two phases: pitch extraction and speech separation. The former aims to extract pitch contour candidates for each speaker from the mixture, modeling the bottom-up process in ASA mechanisms. Any pitch contour can be selected as the condition in the second phase to separate the corresponding speaker, where a conditional generative adversarial network (CGAN) is applied. The second phase models the effect of pitch priming in ASA. Experiments on the WSJ0-2mix corpus reveal that the proposed approaches can achieve higher pitch extraction accuracy and better separation performance, compared to the baseline models, and have the potential to be applied to SOTA architectures.

PGSS: Pitch-Guided Speech Separation

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information