Training-free Open-Vocabulary Semantic Segmentation via Diverse Prototype Construction and Sub-region Matching

Xuanpu Zhao; Dianmo Sheng; Zhentao Tan; Zhiwei Zhao; Tao Gong; Qi Chu; Bin Liu; Nenghai Yu

doi:10.1609/aaai.v39i10.33137

Authors

Xuanpu Zhao School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism
Dianmo Sheng School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism
Zhentao Tan School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism
Zhiwei Zhao School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism
Tao Gong School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism
Qi Chu School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism
Bin Liu School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism
Nenghai Yu School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism

DOI:

https://doi.org/10.1609/aaai.v39i10.33137

Abstract

Open-vocabulary semantic segmentation (OVSS) aims to segment images of arbitrary categories specified by class labels. While previous approaches relied on extensive image-text pairs or dense semantic annotations, recent training-free methods attempted to overcome these limitations by constructing semantic prototypes in the construction stage and image-to-image matching (i.e., prototype matching) during testing. However, these methods often struggle to effectively capture the visual characteristics of categories and fail to utilize local features during prototype matching. To deal with these problems, we propose a novel training-free framework for OVSS that constructs diverse prototypes and performs fine-grained sub-region matching. Specifically, our method leverages Large Language Models (LLMs) to guide support image generation by descriptions of different attributes of categories and employs coarse-fine clustering to obtain diverse and robust part-level prototypes in the construction stage. During testing, we propose a sub-region matching method, which assigns part-level prototypes to sub-regions utilizing optimal transport, to fully utilize local image features among part-level prototypes. Extensive experiments demonstrate the effectiveness of our method and show that our method achieves state-of-the-art performance, outperforming previous methods across five datasets.

Training-free Open-Vocabulary Semantic Segmentation via Diverse Prototype Construction and Sub-region Matching

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information