CALIP: Zero-Shot Enhancement of CLIP with Parameter-Free Attention

Ziyu Guo; Renrui Zhang; Longtian Qiu; Xianzheng Ma; Xupeng Miao; Xuming He; Bin Cui

doi:10.1609/aaai.v37i1.25152

Authors

Ziyu Guo School of CS and Key Lab of HCST, Peking University The Chinese University of Hong Kong
Renrui Zhang The Chinese University of Hong Kong Shanghai AI Laboratory
Longtian Qiu ShanghaiTech University
Xianzheng Ma Shanghai AI Laboratory
Xupeng Miao Carnegie Mellon University
Xuming He ShanghaiTech University
Bin Cui School of CS and Key Lab of HCST, Peking University

DOI:

https://doi.org/10.1609/aaai.v37i1.25152

Keywords:

CV: Language and Vision, CV: Multi-modal Vision, ML: Transfer, Domain Adaptation, Multi-Task Learning

Abstract

Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual representations with promising zero-shot performance. To further improve its downstream accuracy, existing works propose additional learnable modules upon CLIP and fine-tune them by few-shot training sets. However, the resulting extra training cost and data requirement severely hinder the efficiency for model deployment and knowledge transfer. In this paper, we introduce a free-lunch enhancement method, CALIP, to boost CLIP's zero-shot performance via a parameter-free attention module. Specifically, we guide visual and textual representations to interact with each other and explore cross-modal informative features via attention. As the pre-training has largely reduced the embedding distances between two modalities, we discard all learnable parameters in the attention and bidirectionally update the multi-modal features, enabling the whole process to be parameter-free and training-free. In this way, the images are blended with textual-aware signals and the text representations become visual-guided for better adaptive zero-shot alignment. We evaluate CALIP on various benchmarks of 14 datasets for both 2D image and 3D point cloud few-shot classification, showing consistent zero-shot performance improvement over CLIP. Based on that, we further insert a small number of linear layers in CALIP's attention module and verify our robustness under the few-shot settings, which also achieves leading performance compared to existing methods. Those extensive experiments demonstrate the superiority of our approach for efficient enhancement of CLIP. Code is available at https://github.com/ZiyuGuo99/CALIP.

CALIP: Zero-Shot Enhancement of CLIP with Parameter-Free Attention

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription