CP-CLIP: Customized Parameter Generation for Open-vocabulary Semantic Segmentation

Authors

  • Zelin Peng Shanghai Jiao Tong University
  • Zhengqin Xu Shanghai Jiao Tong University
  • Feilong Tang Mohamed bin Zayed University of Artificial Intelligence
  • Wei Shen Shanghai Jiao Tong University

DOI:

https://doi.org/10.1609/aaai.v40i10.37789

Abstract

Open-vocabulary semantic segmentation aims to assign pixel-level labels to images based on textual descriptions, even for categories beyond predefined closed sets. While vision-language foundation models like CLIP are widely used for this task, fine-tuning them for pixel-level predictions often compromises their generalization capabilities. To address this, we propose a novel fine-tuning strategy, CP-CLIP, which generates customized parameters for CLIP without sacrificing its generalization. Our method employs a customized parameter generator that produces newly added parameters based on random noise, using local visual features from CLIP's image encoder as conditions, enabling generalization to new images from unseen scenarios. Additionally, we introduce an orthogonal adaptation technique to ensure the update direction is orthogonal to the pre-trained weights, largely preserving the initial generalization ability. Extensive experiments demonstrate that CP-CLIP achieves state-of-the-art performance across multiple benchmarks in open-vocabulary semantic segmentation.

Downloads

Published

2026-03-14

How to Cite

Peng, Z., Xu, Z., Tang, F., & Shen, W. (2026). CP-CLIP: Customized Parameter Generation for Open-vocabulary Semantic Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(10), 8394–8402. https://doi.org/10.1609/aaai.v40i10.37789

Issue

Section

AAAI Technical Track on Computer Vision VII