Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction

Authors

  • Zhaoxi Mu Xi'an Jiaotong University
  • Xinyu Yang Xi'an Jiaotong University
  • Sining Sun Du Xiaoman
  • Qing Yang Du Xiaoman

DOI:

https://doi.org/10.1609/aaai.v38i17.29846

Keywords:

NLP: Speech, ML: Representation Learning, ML: Unsupervised & Self-Supervised Learning

Abstract

Speech signals are inherently complex as they encompass both global acoustic characteristics and local semantic information. However, in the task of target speech extraction, certain elements of global and local semantic information in the reference speech, which are irrelevant to speaker identity, can lead to speaker confusion within the speech extraction network. To overcome this challenge, we propose a self-supervised disentangled representation learning method. Our approach tackles this issue through a two-phase process, utilizing a reference speech encoding network and a global information disentanglement network to gradually disentangle the speaker identity information from other irrelevant factors. We exclusively employ the disentangled speaker identity information to guide the speech extraction network. Moreover, we introduce the adaptive modulation Transformer to ensure that the acoustic representation of the mixed signal remains undisturbed by the speaker embeddings. This component incorporates speaker embeddings as conditional information, facilitating natural and efficient guidance for the speech extraction network. Experimental results substantiate the effectiveness of our meticulously crafted approach, showcasing a substantial reduction in the likelihood of speaker confusion.

Published

2024-03-24

How to Cite

Mu, Z., Yang, X., Sun, S., & Yang, Q. (2024). Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17), 18815-18823. https://doi.org/10.1609/aaai.v38i17.29846

Issue

Section

AAAI Technical Track on Natural Language Processing II