CP-CLIP: Customized Parameter Generation for Open-vocabulary Semantic Segmentation

Zelin Peng; Zhengqin Xu; Feilong Tang; Wei Shen

doi:10.1609/aaai.v40i10.37789

Authors

Zelin Peng Shanghai Jiao Tong University
Zhengqin Xu Shanghai Jiao Tong University
Feilong Tang Mohamed bin Zayed University of Artificial Intelligence
Wei Shen Shanghai Jiao Tong University

DOI:

https://doi.org/10.1609/aaai.v40i10.37789

Abstract

Open-vocabulary semantic segmentation aims to assign pixel-level labels to images based on textual descriptions, even for categories beyond predefined closed sets. While vision-language foundation models like CLIP are widely used for this task, fine-tuning them for pixel-level predictions often compromises their generalization capabilities. To address this, we propose a novel fine-tuning strategy, CP-CLIP, which generates customized parameters for CLIP without sacrificing its generalization. Our method employs a customized parameter generator that produces newly added parameters based on random noise, using local visual features from CLIP's image encoder as conditions, enabling generalization to new images from unseen scenarios. Additionally, we introduce an orthogonal adaptation technique to ensure the update direction is orthogonal to the pre-trained weights, largely preserving the initial generalization ability. Extensive experiments demonstrate that CP-CLIP achieves state-of-the-art performance across multiple benchmarks in open-vocabulary semantic segmentation.

CP-CLIP: Customized Parameter Generation for Open-vocabulary Semantic Segmentation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information