VFM-Adapter: Adapting Visual Foundation Models for Dense Prediction with Dynamic Hybrid Operation Mapping

Zheng Chen; Yu Zeng; Zehui Chen; Hongzhi Gao; Lin Chen; Jiaming Liu; Feng Zhao

doi:10.1609/aaai.v39i3.32239

Authors

Zheng Chen University of Science and Technology of China
Yu Zeng University of Science and Technology of China
Zehui Chen University of Science and Technology of China
Hongzhi Gao University of Science and Technology of China
Lin Chen University of Science and Technology of China
Jiaming Liu Peking University
Feng Zhao University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v39i3.32239

Abstract

Although pre-trained large vision foundation models (VFM) yield superior results on various downstream tasks, full fine-tuning is often impractical due to its high computational cost and storage requirements. Recent advancements in parameter-efficient fine-tuning (PEFT) of VFM for image classification show significant promise. However, the application of PEFT techniques to dense prediction tasks remains largely unexplored. Our analysis of existing methods reveals that the underlying premise of utilizing low-rank parameter matrices, despite their efficacy in specific applications, may not be adequately suitable for dense prediction tasks. To this end, we propose a novel PEFT learning approach tailored for dense prediction tasks, namely VFM-Adapter. Specifically, the VFM-Adapter introduces a hybrid operation mapping technique that seamlessly integrates local information with global modeling to the adapter module. It capitalizes on the distinct inductive biases inherent in different operations. Additionally, we dynamically generate parameters for the VFM-Adapter, enabling flexibility of feature extraction given specific inputs. To validate the efficacy of VFM-Adapter, we conduct extensive experiments across object detection, semantic segmentation, and instance segmentation tasks. Results on multiple benchmarks consistently demonstrate the superiority of our method over previous approaches. Notably, with only three percent of the trainable parameters of the SAM-Base backbone, our approach achieves competitive or even superior performance compared to full fine-tuning. The code will be available.

VFM-Adapter: Adapting Visual Foundation Models for Dense Prediction with Dynamic Hybrid Operation Mapping

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information