CP-CLIP: Customized Parameter Generation for Open-vocabulary Semantic Segmentation
DOI:
https://doi.org/10.1609/aaai.v40i10.37789Abstract
Open-vocabulary semantic segmentation aims to assign pixel-level labels to images based on textual descriptions, even for categories beyond predefined closed sets. While vision-language foundation models like CLIP are widely used for this task, fine-tuning them for pixel-level predictions often compromises their generalization capabilities. To address this, we propose a novel fine-tuning strategy, CP-CLIP, which generates customized parameters for CLIP without sacrificing its generalization. Our method employs a customized parameter generator that produces newly added parameters based on random noise, using local visual features from CLIP's image encoder as conditions, enabling generalization to new images from unseen scenarios. Additionally, we introduce an orthogonal adaptation technique to ensure the update direction is orthogonal to the pre-trained weights, largely preserving the initial generalization ability. Extensive experiments demonstrate that CP-CLIP achieves state-of-the-art performance across multiple benchmarks in open-vocabulary semantic segmentation.Downloads
Published
2026-03-14
How to Cite
Peng, Z., Xu, Z., Tang, F., & Shen, W. (2026). CP-CLIP: Customized Parameter Generation for Open-vocabulary Semantic Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(10), 8394–8402. https://doi.org/10.1609/aaai.v40i10.37789
Issue
Section
AAAI Technical Track on Computer Vision VII