MCW-KD: Multi-Cost Wasserstein Knowledge Distillation for Large Language Models
DOI:
https://doi.org/10.1609/aaai.v40i39.40619Abstract
Knowledge distillation (KD) is widely recognized as an effective approach for compressing large language models (LLMs). However, standard KD methods often falter when confronted with architectural or tokenization heterogeneity between teacher and student models, which creates a mismatch in their representations. While Optimal Transport (OT) provides a promising solution to align these representations, most OT-based methods rely on a single cost function, which isn’t enough to capture the multifaceted discrepancies between models with distinct designs. To address this limitation, we introduce Multi-Cost Wasserstein Knowledge Distillation (MCW-KD), a novel framework that enhances KD by simultaneously optimizing several cost functions within a unified OT formulation. MCW-KD employs specific cost matrices to effectively align both the final hidden states and the output distributions of the models. We also provide a rigorous theoretical foundation for the proposed Multi-Cost Wasserstein Distance, ensuring both mathematical validity and computational ability. Extensive experiments on instruction-following datasets demonstrate that MCW-KD significantly improves student model performance compared to state-of-the-art KD baselines, especially when teacher and student models have different tokenizers.Downloads
Published
2026-03-14
How to Cite
Vuong, H. T., Le, T., Tran, Q., Van, L. N., & Le, T. (2026). MCW-KD: Multi-Cost Wasserstein Knowledge Distillation for Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(39), 33332–33340. https://doi.org/10.1609/aaai.v40i39.40619
Issue
Section
AAAI Technical Track on Natural Language Processing IV