Extracting Multimodal Learngene in CLIP: Unveiling the Multimodal Generalizable Knowledge

Authors

  • Ruiming Chen Southeast University
  • Junming Yang Southeast University
  • Shiyu Xia Southeast University
  • Xu Yang Southeast University
  • Xin Geng Southeast University, China

DOI:

https://doi.org/10.1609/aaai.v40i24.39111

Abstract

CLIP (Contrastive Language-Image Pre-training) has attracted widespread attention for its multimodal generalizable knowledge, which is significant for downstream tasks. However, the computational overhead of a large number of parameters and large-scale pre-training poses challenges of pre-training a different scale of CLIP. Learngene extracts the generalizable components termed as learngene from an ancestry model and initializes diverse descendant models with it. Previous Learngene paradigms fail to handle the generalizable knowledge in multimodal scenarios. In this paper, we put forward the idea of utilizing a multimodal block to extract the multimodal generalizable knowledge, which inspires us to propose MM-LG (Multimodal Learngene), a novel framework designed to extract and leverage generalizable components from CLIP. Specifically, we first establish multimodal and unimodal blocks to extract the multimodal and unimodal generalizable knowledge in a weighted-sum manner. Subsequently, we employ these components to numerically initialize descendant models of varying scales and modalities. Extensive experiments demonstrate MM-LG's effectiveness, which achieves performance gains over existing learngene approaches (e.g.,+3.1% on Oxford-IIIT PET and +4.13% on Flickr30k) and comparable or superior results to the pre-training and fine-tuning paradigm (e.g.,+1.9% on Oxford-IIIT PET and +3.65% on Flickr30k). Notably, MM-LG requires only around 25% of the parameter storage while reducing around 2.8× pre-training costs for diverse model scales compared to the pre-training and fine-tuning paradigm, making it particularly suitable for efficient deployment across diverse downstream tasks.

Downloads

Published

2026-03-14

How to Cite

Chen, R., Yang, J., Xia, S., Yang, X., & Geng, X. (2026). Extracting Multimodal Learngene in CLIP: Unveiling the Multimodal Generalizable Knowledge. Proceedings of the AAAI Conference on Artificial Intelligence, 40(24), 20235-20243. https://doi.org/10.1609/aaai.v40i24.39111

Issue

Section

AAAI Technical Track on Machine Learning I