MCW-KD: Multi-Cost Wasserstein Knowledge Distillation for Large Language Models

Hoang Tran Vuong; Tue Le; Quyen Tran; Linh Ngo Van; Trung Le

doi:10.1609/aaai.v40i39.40619

Authors

Hoang Tran Vuong Hanoi University of Science and Technology
Tue Le Hanoi University of Science and Technology
Quyen Tran Rutgers University, New Jersey, USA
Linh Ngo Van Hanoi University of Science and Technology
Trung Le Monash University

DOI:

https://doi.org/10.1609/aaai.v40i39.40619

Abstract

Knowledge distillation (KD) is widely recognized as an effective approach for compressing large language models (LLMs). However, standard KD methods often falter when confronted with architectural or tokenization heterogeneity between teacher and student models, which creates a mismatch in their representations. While Optimal Transport (OT) provides a promising solution to align these representations, most OT-based methods rely on a single cost function, which isn’t enough to capture the multifaceted discrepancies between models with distinct designs. To address this limitation, we introduce Multi-Cost Wasserstein Knowledge Distillation (MCW-KD), a novel framework that enhances KD by simultaneously optimizing several cost functions within a unified OT formulation. MCW-KD employs specific cost matrices to effectively align both the final hidden states and the output distributions of the models. We also provide a rigorous theoretical foundation for the proposed Multi-Cost Wasserstein Distance, ensuring both mathematical validity and computational ability. Extensive experiments on instruction-following datasets demonstrate that MCW-KD significantly improves student model performance compared to state-of-the-art KD baselines, especially when teacher and student models have different tokenizers.

MCW-KD: Multi-Cost Wasserstein Knowledge Distillation for Large Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information