Cross-Modal Distillation for Speaker Recognition


  • Yufeng Jin Tongji University
  • Guosheng Hu Oosto
  • Haonan Chen Alibaba Group
  • Duoqian Miao Tongji University
  • Liang Hu Tongji University
  • Cairong Zhao Tongji University



SNLP: Speech and Multimodality, CV: Biometrics, Face, Gesture & Pose, CV: Multi-modal Vision, APP: Biometrics, ML: Multimodal Learning, ML: Representation Learning


Speaker recognition achieved great progress recently, however, it is not easy or efficient to further improve its performance via traditional solutions: collecting more data and designing new neural networks. Aiming at the fundamental challenge of speech data, i.e. low information density, multimodal learning can mitigate this challenge by introducing richer and more discriminative information as input for identity recognition. Specifically, since the face image is more discriminative than the speech for identity recognition, we conduct multimodal learning by introducing a face recognition model (teacher) to transfer discriminative knowledge to a speaker recognition model (student) during training. However, this knowledge transfer via distillation is not trivial because the big domain gap between face and speech can easily lead to overfitting. In this work, we introduce a multimodal learning framework, VGSR (Vision-Guided Speaker Recognition). Specifically, we propose a MKD (Margin-based Knowledge Distillation) strategy for cross-modality distillation by introducing a loose constrain to align the teacher and student, greatly reducing overfitting. Our MKD strategy can easily adapt to various existing knowledge distillation methods. In addition, we propose a QAW (Quality-based Adaptive Weights) module to weight input samples via quantified data quality, leading to a robust model training. Experimental results on the VoxCeleb1 and CN-Celeb datasets show our proposed strategies can effectively improve the accuracy of speaker recognition by a margin of 10% ∼ 15%, and our methods are very robust to different noises.




How to Cite

Jin, Y., Hu, G., Chen, H., Miao, D., Hu, L., & Zhao, C. (2023). Cross-Modal Distillation for Speaker Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 37(11), 12977-12985.



AAAI Technical Track on Speech & Natural Language Processing