Rethinking the Dark Knowledge and Kullback-Leibler Divergence Loss in Knowledge Distillation Under Capacity Mismatching

Yingchao Wang; Wenqi Niu; Xingshan Yao; Li You; Weilun Fei

doi:10.1609/aaai.v40i31.39878

Authors

Yingchao Wang Beijing Institute of Technology
Wenqi Niu Beijing Institute of Technology
Xingshan Yao Beijing Institute of Technology
Li You Beijing Institute of Technology
Weilun Fei Beijing Institute of Technology

DOI:

https://doi.org/10.1609/aaai.v40i31.39878

Abstract

Knowledge Distillation (KD) aims to transfer the dark knowledge that encodes inter-class similarity, semantic structure, and decision boundaries from a powerful teacher model to a compact student model by minimizing the Kullback-Leibler (KL) divergence between their output distributions. While effective, we demonstrate that KL-based KD is designed to match values precisely and does not explicitly constrain the relative relationships between classes. Meanwhile, we empirically find that vanilla KL-based KD suffers from gradient competition due to the zero-sum constraint in the softmax space, which may implicitly change the inter-class rank relationships learned by the student model, particularly under capacity mismatching. Therefore, we argue that the student model should learn not only the output values but also the relative ranking of classes. Accordingly, we propose a simple yet effective Relative Confidence Knowledge Distillation (RCKD) method that aligns the teacher’s and student’s relative confidence matrices via cosine similarity, achieving more efficient and robust distillation from a stronger teacher model. Extensive experiments demonstrate that RCKD consistently outperforms existing logit-based KD methods and exhibits strong adaptability across various teacher architectures and capacities.

Rethinking the Dark Knowledge and Kullback-Leibler Divergence Loss in Knowledge Distillation Under Capacity Mismatching

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information