PGSS: Pitch-Guided Speech Separation


  • Xiang Li Peking University
  • Yiwen Wang Peking University
  • Yifan Sun Peking University
  • Xihong Wu Peking University
  • Jing Chen Peking University



SNLP: Speech and Multimodality, SNLP: Applications, SNLP: Other Foundations of Speech & Natural Language Processing


Monaural speech separation aims to separate concurrent speakers from a single-microphone mixture recording. Inspired by the effect of pitch priming in auditory scene analysis (ASA) mechanisms, a novel pitch-guided speech separation framework is proposed in this work. The prominent advantage of this framework is that both the permutation problem and the unknown speaker number problem existing in general models can be avoided by using pitch contours as the primary means to guide the target speaker. In addition, adversarial training is applied, instead of a traditional time-frequency mask, to improve the perceptual quality of separated speech. Specifically, the proposed framework can be divided into two phases: pitch extraction and speech separation. The former aims to extract pitch contour candidates for each speaker from the mixture, modeling the bottom-up process in ASA mechanisms. Any pitch contour can be selected as the condition in the second phase to separate the corresponding speaker, where a conditional generative adversarial network (CGAN) is applied. The second phase models the effect of pitch priming in ASA. Experiments on the WSJ0-2mix corpus reveal that the proposed approaches can achieve higher pitch extraction accuracy and better separation performance, compared to the baseline models, and have the potential to be applied to SOTA architectures.




How to Cite

Li, X., Wang, Y., Sun, Y., Wu, X., & Chen, J. (2023). PGSS: Pitch-Guided Speech Separation. Proceedings of the AAAI Conference on Artificial Intelligence, 37(11), 13130-13138.



AAAI Technical Track on Speech & Natural Language Processing