Prompting Multi-Modal Image Segmentation with Semantic Grouping

Authors

  • Qibin He University of Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v38i3.27981

Keywords:

CV: Scene Analysis & Understanding, CV: Large Vision Models, CV: Multi-modal Vision, CV: Representation Learning for Vision, CV: Segmentation

Abstract

Multi-modal image segmentation is one of the core issues in computer vision. The main challenge lies in integrating common information between modalities while retaining specific patterns for each modality. Existing methods typically perform full fine-tuning on RGB-based pre-trained parameters to inherit the powerful representation of the foundation model. Although effective, such paradigm is not optimal due to weak transferability and scarce downstream data. Inspired by the recent success of prompt learning in language models, we propose the Grouping Prompt Tuning Framework (GoPT), which introduces explicit semantic grouping to learn modal-related prompts, adapting the frozen pre-trained foundation model to various downstream multi-modal segmentation tasks. Specifically, a class-aware uni-modal prompter is designed to balance intra- and inter-modal semantic propagation by grouping modality-specific class tokens, thereby improving the adaptability of spatial information. Furthermore, an alignment-induced cross-modal prompter is introduced to aggregate class-aware representations and share prompt parameters among different modalities to assist in modeling common statistics. Extensive experiments show the superiority of our GoPT, which achieves SOTA performance on various downstream multi-modal image segmentation tasks by training only < 1% model parameters.

Published

2024-03-24

How to Cite

He, Q. (2024). Prompting Multi-Modal Image Segmentation with Semantic Grouping. Proceedings of the AAAI Conference on Artificial Intelligence, 38(3), 2094-2102. https://doi.org/10.1609/aaai.v38i3.27981

Issue

Section

AAAI Technical Track on Computer Vision II