Rethinking the Dark Knowledge and Kullback-Leibler Divergence Loss in Knowledge Distillation Under Capacity Mismatching

Authors

  • Yingchao Wang Beijing Institute of Technology
  • Wenqi Niu Beijing Institute of Technology
  • Xingshan Yao Beijing Institute of Technology
  • Li You Beijing Institute of Technology
  • Weilun Fei Beijing Institute of Technology

DOI:

https://doi.org/10.1609/aaai.v40i31.39878

Abstract

Knowledge Distillation (KD) aims to transfer the dark knowledge that encodes inter-class similarity, semantic structure, and decision boundaries from a powerful teacher model to a compact student model by minimizing the Kullback-Leibler (KL) divergence between their output distributions. While effective, we demonstrate that KL-based KD is designed to match values precisely and does not explicitly constrain the relative relationships between classes. Meanwhile, we empirically find that vanilla KL-based KD suffers from gradient competition due to the zero-sum constraint in the softmax space, which may implicitly change the inter-class rank relationships learned by the student model, particularly under capacity mismatching. Therefore, we argue that the student model should learn not only the output values but also the relative ranking of classes. Accordingly, we propose a simple yet effective Relative Confidence Knowledge Distillation (RCKD) method that aligns the teacher’s and student’s relative confidence matrices via cosine similarity, achieving more efficient and robust distillation from a stronger teacher model. Extensive experiments demonstrate that RCKD consistently outperforms existing logit-based KD methods and exhibits strong adaptability across various teacher architectures and capacities.

Published

2026-03-14

How to Cite

Wang, Y., Niu, W., Yao, X., You, L., & Fei, W. (2026). Rethinking the Dark Knowledge and Kullback-Leibler Divergence Loss in Knowledge Distillation Under Capacity Mismatching. Proceedings of the AAAI Conference on Artificial Intelligence, 40(31), 26688–26696. https://doi.org/10.1609/aaai.v40i31.39878

Issue

Section

AAAI Technical Track on Machine Learning VIII