Spectral Prompt Tuning: Unveiling Unseen Classes for Zero-Shot Semantic Segmentation

Authors

  • Wenhao Xu School of Artificial Intelligence, Beijing University of Posts and Telecommunications
  • Rongtao Xu State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence,University of Chinese Academy of Sciences
  • Changwei Wang State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence,University of Chinese Academy of Sciences
  • Shibiao Xu School of Artificial Intelligence, Beijing University of Posts and Telecommunications
  • Li Guo School of Artificial Intelligence, Beijing University of Posts and Telecommunications
  • Man Zhang School of Artificial Intelligence, Beijing University of Posts and Telecommunications
  • Xiaopeng Zhang State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v38i6.28456

Keywords:

CV: Language and Vision, CV: Large Vision Models, CV: Multi-modal Vision

Abstract

Recently, CLIP has found practical utility in the domain of pixel-level zero-shot segmentation tasks. The present landscape features two-stage methodologies beset by issues such as intricate pipelines and elevated computational costs. While current one-stage approaches alleviate these concerns and incorporate Visual Prompt Training (VPT) to uphold CLIP's generalization capacity, they still fall short in fully harnessing CLIP's potential for pixel-level unseen class demarcation and precise pixel predictions. To further stimulate CLIP's zero-shot dense prediction capability, we propose SPT-SEG, a one-stage approach that improves CLIP's adaptability from image to pixel. Specifically, we initially introduce Spectral Prompt Tuning (SPT), incorporating spectral prompts into the CLIP visual encoder's shallow layers to capture structural intricacies of images, thereby enhancing comprehension of unseen classes. Subsequently, we introduce the Spectral Guided Decoder (SGD), utilizing both high and low-frequency information to steer the network's spatial focus towards more prominent classification features, enabling precise pixel-level prediction outcomes. Through extensive experiments on two public datasets, we demonstrate the superiority of our method over state-of-the-art approaches, performing well across all classes and particularly excelling in handling unseen classes.

Published

2024-03-24

How to Cite

Xu, W., Xu, R., Wang, C., Xu, S., Guo, L., Zhang, M., & Zhang, X. (2024). Spectral Prompt Tuning: Unveiling Unseen Classes for Zero-Shot Semantic Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 38(6), 6369-6377. https://doi.org/10.1609/aaai.v38i6.28456

Issue

Section

AAAI Technical Track on Computer Vision V