FATE: Feature-Adapted Parameter Tuning for Vision-Language Models
DOI:
https://doi.org/10.1609/aaai.v39i9.32975Abstract
Following the recent popularity of vision language models, several attempts, e.g., parameter-efficient fine-tuning (PEFT), have been made to extend them to different downstream tasks. Previous PEFT works motivate their methods from the view of introducing new parameters for adaptation but still need to learn this part of weight from scratch, i.e., random initialization. In this paper, we present a novel strategy that incorporates the potential of prompts, e.g., vision features, to facilitate the initial parameter space adapting to new scenarios. We introduce a Feature-Adapted parameTer Efficient tuning paradigm for vision-language models, dubbed as FATE, which injects informative features from the vision encoder into language encoder's parameters space. Specifically, we extract vision features from the last layer of CLIP's vision encoder and, after projection, treat them as parameters for fine-tuning each layer of CLIP's language encoder. By adjusting these feature-adapted parameters, we can directly enable communication between the vision and language branches, facilitating CLIP's adaptation to different scenarios. Experimental results show that FATE exhibits superior generalization performance on 11 datasets with a very small amount of extra parameters and computation.Downloads
Published
2025-04-11
How to Cite
Xu, Z., Peng, Z., Yang, X., & Shen, W. (2025). FATE: Feature-Adapted Parameter Tuning for Vision-Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 39(9), 9014-9022. https://doi.org/10.1609/aaai.v39i9.32975
Issue
Section
AAAI Technical Track on Computer Vision VIII