LLM-Oriented Token-Adaptive Knowledge Distillation

Authors

  • Xurong Xie Zhejiang University
  • Zhucun Xue Zhejiang University
  • Jiafu Wu Tencent Youtu Lab
  • Jian Li Tencent Youtu Lab
  • Yabiao Wang Tencent Youtu Lab
  • Xiaobin Hu National University of Singapore
  • Yong Liu Zhejiang University
  • Jiangning Zhang Zhejiang University Tencent Youtu Lab

DOI:

https://doi.org/10.1609/aaai.v40i40.40701

Abstract

Knowledge Distillation (KD) is a key technique for compressing Large-scale Language Models (LLMs), but prevailing logit-based methods employ static strategies misaligned with the student’s dynamic learning process. By treating all tokens indiscriminately with a fixed temperature, these methods result in suboptimal knowledge transfer. To address this, we propose LLM-oriented token-Adaptive Knowledge Distillation (AdaKD), a framework that adapts the distillation process to each token’s real-time learning state. AdaKD consists of two synergistic modules driven by a unified token difficulty metric. First, the Loss-driven Adaptive Token Focusing (LATF) module dynamically concentrates distillation on valuable tokens by monitoring the student’s learning stability. Second, Inverse Difficulty Temperature Scaling (IDTS) introduces a counterintuitive token-level temperature: low for difficult tokens to target error correction, and high for easy tokens to learn the teacher’s smooth output distribution for better generalization. As a plug-and-play framework, AdaKD consistently improves performance across diverse distillation methods, model architectures, and benchmarks.

Published

2026-03-14

How to Cite

Xie, X., Xue, Z., Wu, J., Li, J., Wang, Y., Hu, X., … Zhang, J. (2026). LLM-Oriented Token-Adaptive Knowledge Distillation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40), 34070–34078. https://doi.org/10.1609/aaai.v40i40.40701

Issue

Section

AAAI Technical Track on Natural Language Processing V