VFM-Adapter: Adapting Visual Foundation Models for Dense Prediction with Dynamic Hybrid Operation Mapping

Authors

  • Zheng Chen University of Science and Technology of China
  • Yu Zeng University of Science and Technology of China
  • Zehui Chen University of Science and Technology of China
  • Hongzhi Gao University of Science and Technology of China
  • Lin Chen University of Science and Technology of China
  • Jiaming Liu Peking University
  • Feng Zhao University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v39i3.32239

Abstract

Although pre-trained large vision foundation models (VFM) yield superior results on various downstream tasks, full fine-tuning is often impractical due to its high computational cost and storage requirements. Recent advancements in parameter-efficient fine-tuning (PEFT) of VFM for image classification show significant promise. However, the application of PEFT techniques to dense prediction tasks remains largely unexplored. Our analysis of existing methods reveals that the underlying premise of utilizing low-rank parameter matrices, despite their efficacy in specific applications, may not be adequately suitable for dense prediction tasks. To this end, we propose a novel PEFT learning approach tailored for dense prediction tasks, namely VFM-Adapter. Specifically, the VFM-Adapter introduces a hybrid operation mapping technique that seamlessly integrates local information with global modeling to the adapter module. It capitalizes on the distinct inductive biases inherent in different operations. Additionally, we dynamically generate parameters for the VFM-Adapter, enabling flexibility of feature extraction given specific inputs. To validate the efficacy of VFM-Adapter, we conduct extensive experiments across object detection, semantic segmentation, and instance segmentation tasks. Results on multiple benchmarks consistently demonstrate the superiority of our method over previous approaches. Notably, with only three percent of the trainable parameters of the SAM-Base backbone, our approach achieves competitive or even superior performance compared to full fine-tuning. The code will be available.

Downloads

Published

2025-04-11

How to Cite

Chen, Z., Zeng, Y., Chen, Z., Gao, H., Chen, L., Liu, J., & Zhao, F. (2025). VFM-Adapter: Adapting Visual Foundation Models for Dense Prediction with Dynamic Hybrid Operation Mapping. Proceedings of the AAAI Conference on Artificial Intelligence, 39(3), 2385–2393. https://doi.org/10.1609/aaai.v39i3.32239

Issue

Section

AAAI Technical Track on Computer Vision II