Cross-Modal Distillation for Speaker Recognition

Yufeng Jin; Guosheng Hu; Haonan Chen; Duoqian Miao; Liang Hu; Cairong Zhao

doi:10.1609/aaai.v37i11.26525

Authors

Yufeng Jin Tongji University
Guosheng Hu Oosto
Haonan Chen Alibaba Group
Duoqian Miao Tongji University
Liang Hu Tongji University
Cairong Zhao Tongji University

DOI:

https://doi.org/10.1609/aaai.v37i11.26525

Keywords:

SNLP: Speech and Multimodality, CV: Biometrics, Face, Gesture & Pose, CV: Multi-modal Vision, APP: Biometrics, ML: Multimodal Learning, ML: Representation Learning

Abstract

Speaker recognition achieved great progress recently, however, it is not easy or efficient to further improve its performance via traditional solutions: collecting more data and designing new neural networks. Aiming at the fundamental challenge of speech data, i.e. low information density, multimodal learning can mitigate this challenge by introducing richer and more discriminative information as input for identity recognition. Specifically, since the face image is more discriminative than the speech for identity recognition, we conduct multimodal learning by introducing a face recognition model (teacher) to transfer discriminative knowledge to a speaker recognition model (student) during training. However, this knowledge transfer via distillation is not trivial because the big domain gap between face and speech can easily lead to overfitting. In this work, we introduce a multimodal learning framework, VGSR (Vision-Guided Speaker Recognition). Specifically, we propose a MKD (Margin-based Knowledge Distillation) strategy for cross-modality distillation by introducing a loose constrain to align the teacher and student, greatly reducing overfitting. Our MKD strategy can easily adapt to various existing knowledge distillation methods. In addition, we propose a QAW (Quality-based Adaptive Weights) module to weight input samples via quantified data quality, leading to a robust model training. Experimental results on the VoxCeleb1 and CN-Celeb datasets show our proposed strategies can effectively improve the accuracy of speaker recognition by a margin of 10% ∼ 15%, and our methods are very robust to different noises.

Cross-Modal Distillation for Speaker Recognition

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription