Training-free Open-Vocabulary Semantic Segmentation via Diverse Prototype Construction and Sub-region Matching

Authors

  • Xuanpu Zhao School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism
  • Dianmo Sheng School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism
  • Zhentao Tan School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism
  • Zhiwei Zhao School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism
  • Tao Gong School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism
  • Qi Chu School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism
  • Bin Liu School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism
  • Nenghai Yu School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism

DOI:

https://doi.org/10.1609/aaai.v39i10.33137

Abstract

Open-vocabulary semantic segmentation (OVSS) aims to segment images of arbitrary categories specified by class labels. While previous approaches relied on extensive image-text pairs or dense semantic annotations, recent training-free methods attempted to overcome these limitations by constructing semantic prototypes in the construction stage and image-to-image matching (i.e., prototype matching) during testing. However, these methods often struggle to effectively capture the visual characteristics of categories and fail to utilize local features during prototype matching. To deal with these problems, we propose a novel training-free framework for OVSS that constructs diverse prototypes and performs fine-grained sub-region matching. Specifically, our method leverages Large Language Models (LLMs) to guide support image generation by descriptions of different attributes of categories and employs coarse-fine clustering to obtain diverse and robust part-level prototypes in the construction stage. During testing, we propose a sub-region matching method, which assigns part-level prototypes to sub-regions utilizing optimal transport, to fully utilize local image features among part-level prototypes. Extensive experiments demonstrate the effectiveness of our method and show that our method achieves state-of-the-art performance, outperforming previous methods across five datasets.

Downloads

Published

2025-04-11

How to Cite

Zhao, X., Sheng, D., Tan, Z., Zhao, Z., Gong, T., Chu, Q., … Yu, N. (2025). Training-free Open-Vocabulary Semantic Segmentation via Diverse Prototype Construction and Sub-region Matching. Proceedings of the AAAI Conference on Artificial Intelligence, 39(10), 10474–10482. https://doi.org/10.1609/aaai.v39i10.33137

Issue

Section

AAAI Technical Track on Computer Vision IX