LLM-Oriented Token-Adaptive Knowledge Distillation

Xurong Xie; Zhucun Xue; Jiafu Wu; Jian Li; Yabiao Wang; Xiaobin Hu; Yong Liu; Jiangning Zhang

doi:10.1609/aaai.v40i40.40701

Authors

Xurong Xie Zhejiang University
Zhucun Xue Zhejiang University
Jiafu Wu Tencent Youtu Lab
Jian Li Tencent Youtu Lab
Yabiao Wang Tencent Youtu Lab
Xiaobin Hu National University of Singapore
Yong Liu Zhejiang University
Jiangning Zhang Zhejiang University Tencent Youtu Lab

DOI:

https://doi.org/10.1609/aaai.v40i40.40701

Abstract

Knowledge Distillation (KD) is a key technique for compressing Large-scale Language Models (LLMs), but prevailing logit-based methods employ static strategies misaligned with the student’s dynamic learning process. By treating all tokens indiscriminately with a fixed temperature, these methods result in suboptimal knowledge transfer. To address this, we propose LLM-oriented token-Adaptive Knowledge Distillation (AdaKD), a framework that adapts the distillation process to each token’s real-time learning state. AdaKD consists of two synergistic modules driven by a unified token difficulty metric. First, the Loss-driven Adaptive Token Focusing (LATF) module dynamically concentrates distillation on valuable tokens by monitoring the student’s learning stability. Second, Inverse Difficulty Temperature Scaling (IDTS) introduces a counterintuitive token-level temperature: low for difficult tokens to target error correction, and high for easy tokens to learn the teacher’s smooth output distribution for better generalization. As a plug-and-play framework, AdaKD consistently improves performance across diverse distillation methods, model architectures, and benchmarks.

LLM-Oriented Token-Adaptive Knowledge Distillation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information