Bridging the Tokenizer Gap: Semantics and Distribution-aware Knowledge Transfer for Unbiased Cross-Tokenizer Distillation

Authors

  • Huazheng Wang Beijing University of Posts and Telecommunications, Beijing 100876, China Nanyang Technological University
  • Yongcheng Jing Nanyang Technological University, Singapore 639798
  • Haifeng Sun Beijing University of Posts and Telecommunications, Beijing 100876, China
  • Jingyu Wang Beijing University of Posts and Telecommunications, Beijing 100876, China
  • Jianxin Liao Beijing University of Posts and Telecommunications, Beijing 100876, China
  • Leszek Rutkowski AGH University of Krakow, 30-059 Kraków, and the SAN University, 90-113, Łódź, Poland
  • Dacheng Tao Nanyang Technological University, Singapore 639798

DOI:

https://doi.org/10.1609/aaai.v40i39.40637

Abstract

Cross-tokenizer knowledge distillation, where the teacher and student employ different tokenizers, is becoming increasingly prevalent, yet it poses underexplored challenges: existing methods fail to capture the rich knowledge encoded in teacher logits, as evidenced by the neglect of semantic information, inaccurate and biased logit alignment, and discarding distributional structure—ultimately leading to unfavorable distillation. To address these issues, we propose SeDi, a semantics and distribution-aware knowledge transfer framework tailored for cross-tokenizer distillation. To preserve factual knowledge, SeDi employs bipartite graph-based alignment at the tokenization level and a sliding window re-encoding strategy at the vocabulary level, enabling unbiased transfer of the teacher’s next-token predictions into the student’s vocabulary space. To further retain distributional information, we align the student’s entropy with that of the teacher by incorporating the student’s own logits during training, which helps to mitigate the exposure bias problem. Experiments on ten datasets across three task domains and five different teacher-student model pairs with varying vocabulary sizes demonstrate that SeDi delivers substantial improvements, with gains of up to 19.8%.

Published

2026-03-14

How to Cite

Wang, H., Jing, Y., Sun, H., Wang, J., Liao, J., Rutkowski, L., & Tao, D. (2026). Bridging the Tokenizer Gap: Semantics and Distribution-aware Knowledge Transfer for Unbiased Cross-Tokenizer Distillation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(39), 33494–33502. https://doi.org/10.1609/aaai.v40i39.40637

Issue

Section

AAAI Technical Track on Natural Language Processing IV