FATE: Feature-Adapted Parameter Tuning for Vision-Language Models

Authors

  • Zhengqin Xu Shanghai Jiao Tong University, China
  • Zelin Peng Shanghai Jiao Tong University, China
  • Xiaokang Yang Shanghai Jiao Tong University, China
  • Wei Shen Shanghai Jiao Tong University, China

DOI:

https://doi.org/10.1609/aaai.v39i9.32975

Abstract

Following the recent popularity of vision language models, several attempts, e.g., parameter-efficient fine-tuning (PEFT), have been made to extend them to different downstream tasks. Previous PEFT works motivate their methods from the view of introducing new parameters for adaptation but still need to learn this part of weight from scratch, i.e., random initialization. In this paper, we present a novel strategy that incorporates the potential of prompts, e.g., vision features, to facilitate the initial parameter space adapting to new scenarios. We introduce a Feature-Adapted parameTer Efficient tuning paradigm for vision-language models, dubbed as FATE, which injects informative features from the vision encoder into language encoder's parameters space. Specifically, we extract vision features from the last layer of CLIP's vision encoder and, after projection, treat them as parameters for fine-tuning each layer of CLIP's language encoder. By adjusting these feature-adapted parameters, we can directly enable communication between the vision and language branches, facilitating CLIP's adaptation to different scenarios. Experimental results show that FATE exhibits superior generalization performance on 11 datasets with a very small amount of extra parameters and computation.

Downloads

Published

2025-04-11

How to Cite

Xu, Z., Peng, Z., Yang, X., & Shen, W. (2025). FATE: Feature-Adapted Parameter Tuning for Vision-Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 39(9), 9014-9022. https://doi.org/10.1609/aaai.v39i9.32975

Issue

Section

AAAI Technical Track on Computer Vision VIII