FATE: Feature-Adapted Parameter Tuning for Vision-Language Models

Zhengqin Xu; Zelin Peng; Xiaokang Yang; Wei Shen

doi:10.1609/aaai.v39i9.32975

Authors

Zhengqin Xu Shanghai Jiao Tong University, China
Zelin Peng Shanghai Jiao Tong University, China
Xiaokang Yang Shanghai Jiao Tong University, China
Wei Shen Shanghai Jiao Tong University, China

DOI:

https://doi.org/10.1609/aaai.v39i9.32975

Abstract

Following the recent popularity of vision language models, several attempts, e.g., parameter-efficient fine-tuning (PEFT), have been made to extend them to different downstream tasks. Previous PEFT works motivate their methods from the view of introducing new parameters for adaptation but still need to learn this part of weight from scratch, i.e., random initialization. In this paper, we present a novel strategy that incorporates the potential of prompts, e.g., vision features, to facilitate the initial parameter space adapting to new scenarios. We introduce a Feature-Adapted parameTer Efficient tuning paradigm for vision-language models, dubbed as FATE, which injects informative features from the vision encoder into language encoder's parameters space. Specifically, we extract vision features from the last layer of CLIP's vision encoder and, after projection, treat them as parameters for fine-tuning each layer of CLIP's language encoder. By adjusting these feature-adapted parameters, we can directly enable communication between the vision and language branches, facilitating CLIP's adaptation to different scenarios. Experimental results show that FATE exhibits superior generalization performance on 11 datasets with a very small amount of extra parameters and computation.

FATE: Feature-Adapted Parameter Tuning for Vision-Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information