MCW-KD: Multi-Cost Wasserstein Knowledge Distillation for Large Language Models

Authors

  • Hoang Tran Vuong Hanoi University of Science and Technology
  • Tue Le Hanoi University of Science and Technology
  • Quyen Tran Rutgers University, New Jersey, USA
  • Linh Ngo Van Hanoi University of Science and Technology
  • Trung Le Monash University

DOI:

https://doi.org/10.1609/aaai.v40i39.40619

Abstract

Knowledge distillation (KD) is widely recognized as an effective approach for compressing large language models (LLMs). However, standard KD methods often falter when confronted with architectural or tokenization heterogeneity between teacher and student models, which creates a mismatch in their representations. While Optimal Transport (OT) provides a promising solution to align these representations, most OT-based methods rely on a single cost function, which isn’t enough to capture the multifaceted discrepancies between models with distinct designs. To address this limitation, we introduce Multi-Cost Wasserstein Knowledge Distillation (MCW-KD), a novel framework that enhances KD by simultaneously optimizing several cost functions within a unified OT formulation. MCW-KD employs specific cost matrices to effectively align both the final hidden states and the output distributions of the models. We also provide a rigorous theoretical foundation for the proposed Multi-Cost Wasserstein Distance, ensuring both mathematical validity and computational ability. Extensive experiments on instruction-following datasets demonstrate that MCW-KD significantly improves student model performance compared to state-of-the-art KD baselines, especially when teacher and student models have different tokenizers.

Published

2026-03-14

How to Cite

Vuong, H. T., Le, T., Tran, Q., Van, L. N., & Le, T. (2026). MCW-KD: Multi-Cost Wasserstein Knowledge Distillation for Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(39), 33332–33340. https://doi.org/10.1609/aaai.v40i39.40619

Issue

Section

AAAI Technical Track on Natural Language Processing IV