Bridging the Tokenizer Gap: Semantics and Distribution-aware Knowledge Transfer for Unbiased Cross-Tokenizer Distillation

Huazheng Wang; Yongcheng Jing; Haifeng Sun; Jingyu Wang; Jianxin Liao; Leszek Rutkowski; Dacheng Tao

doi:10.1609/aaai.v40i39.40637

Authors

Huazheng Wang Beijing University of Posts and Telecommunications, Beijing 100876, China Nanyang Technological University
Yongcheng Jing Nanyang Technological University, Singapore 639798
Haifeng Sun Beijing University of Posts and Telecommunications, Beijing 100876, China
Jingyu Wang Beijing University of Posts and Telecommunications, Beijing 100876, China
Jianxin Liao Beijing University of Posts and Telecommunications, Beijing 100876, China
Leszek Rutkowski AGH University of Krakow, 30-059 Kraków, and the SAN University, 90-113, Łódź, Poland
Dacheng Tao Nanyang Technological University, Singapore 639798

DOI:

https://doi.org/10.1609/aaai.v40i39.40637

Abstract

Cross-tokenizer knowledge distillation, where the teacher and student employ different tokenizers, is becoming increasingly prevalent, yet it poses underexplored challenges: existing methods fail to capture the rich knowledge encoded in teacher logits, as evidenced by the neglect of semantic information, inaccurate and biased logit alignment, and discarding distributional structure—ultimately leading to unfavorable distillation. To address these issues, we propose SeDi, a semantics and distribution-aware knowledge transfer framework tailored for cross-tokenizer distillation. To preserve factual knowledge, SeDi employs bipartite graph-based alignment at the tokenization level and a sliding window re-encoding strategy at the vocabulary level, enabling unbiased transfer of the teacher’s next-token predictions into the student’s vocabulary space. To further retain distributional information, we align the student’s entropy with that of the teacher by incorporating the student’s own logits during training, which helps to mitigate the exposure bias problem. Experiments on ten datasets across three task domains and five different teacher-student model pairs with varying vocabulary sizes demonstrate that SeDi delivers substantial improvements, with gains of up to 19.8%.

Bridging the Tokenizer Gap: Semantics and Distribution-aware Knowledge Transfer for Unbiased Cross-Tokenizer Distillation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information