Distilling Cross-Modal Knowledge via Feature Disentanglement

Junhong Liu; Yuan Zhang; Tao Huang; Wenchao Xu; Renyu Yang

doi:10.1609/aaai.v40i28.39548

Authors

Junhong Liu School of Software, Beihang University
Yuan Zhang School of Computer Science, Peking University
Tao Huang Shanghai Jiao Tong University
Wenchao Xu Hong Kong University of Science and Technology
Renyu Yang School of Software, Beihang University

DOI:

https://doi.org/10.1609/aaai.v40i28.39548

Abstract

Knowledge distillation (KD) has proven highly effective for compressing large models and enhancing the performance of smaller ones. However, its effectiveness diminishes in cross-modal scenarios, such as vision-to-language distillation, where inconsistencies in representation across modalities lead to difficult knowledge transfer. To address this challenge, we propose frequency-decoupled cross-modal knowledge distillation, a method designed to decouple and balance knowledge transfer across modalities by leveraging frequency-domain features. We observed that low-frequency features exhibit high consistency across different modalities, whereas high-frequency features demonstrate extremely low cross-modal similarity. Accordingly, we apply distinct losses to these features: enforcing strong alignment in the low-frequency domain and introducing relaxed alignment for high-frequency features. We also propose a scale consistency loss to address distributional shifts between modalities, and employ a shared classifier to unify feature spaces. Extensive experiments across multiple benchmark datasets show our method substantially outperforms traditional KD and state-of-the-art cross-modal KD approaches.

Distilling Cross-Modal Knowledge via Feature Disentanglement

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information