Expediting Contrastive Language-Image Pretraining via Self-Distilled Encoders

Authors

  • Bumsoo Kim LG AI Research
  • Jinhyung Kim LG AI Research
  • Yeonsik Jo LG AI Research
  • Seung Hwan Kim LG AI Research

DOI:

https://doi.org/10.1609/aaai.v38i3.28052

Keywords:

CV: Language and Vision, CV: Multi-modal Vision

Abstract

Recent advances in vision language pretraining (VLP) have been largely attributed to the large-scale data collected from the web. However, uncurated dataset contains weakly correlated image-text pairs, causing data inefficiency. To address the issue, knowledge distillation have been explored at the expense of extra image and text momentum encoders to generate teaching signals for misaligned image-text pairs. In this paper, our goal is to resolve the misalignment problem with an efficient distillation framework. To this end, we propose ECLIPSE: Expediting Contrastive Language-Image Pretraining with Self-distilled Encoders. ECLIPSE features a distinctive distillation architecture wherein a shared text encoder is utilized between an online image encoder and a momentum image encoder. This strategic design choice enables the distillation to operate within a unified projected space of text embedding, resulting in better performance. Based on the unified text embedding space, ECLIPSE compensates for the additional computational cost of the momentum image encoder by expediting the online image encoder. Through our extensive experiments, we validate that there is a sweet spot between expedition and distillation where the partial view from the expedited online image encoder interacts complementarily with the momentum teacher. As a result, ECLIPSE outperforms its counterparts while achieving substantial acceleration in inference speed.

Published

2024-03-24

How to Cite

Kim, B., Kim, J., Jo, Y., & Kim, S. H. (2024). Expediting Contrastive Language-Image Pretraining via Self-Distilled Encoders. Proceedings of the AAAI Conference on Artificial Intelligence, 38(3), 2732-2740. https://doi.org/10.1609/aaai.v38i3.28052

Issue

Section

AAAI Technical Track on Computer Vision II