Semantic-Augmented Image Clustering via Adaptive Multi-Modal Collaboration

Xiaohan Zhang; Chao Zhang; Deng Xu; Hong YU; Chunlin Chen; Huaxiong Li

doi:10.1609/aaai.v40i33.40073

Authors

Xiaohan Zhang Nanjing University
Chao Zhang Nanjing University
Deng Xu Nanjing University
Hong YU Chongqing University of Post and Telecommunications
Chunlin Chen Nanjing University
Huaxiong Li Nanjing University

DOI:

https://doi.org/10.1609/aaai.v40i33.40073

Abstract

Image clustering is a fundamental task in unsupervised visual learning. While recent self-supervised methods have explored various pretext tasks to generate supervision signals for clustering, they typically depend exclusively on raw images, resulting in insufficient supervision signals that are inherently constrained by limited visual semantics. In this paper, we propose a novel Semantic-Augmented image Clustering (SAC) method, which transcends the inherent limitations of purely visual representations through the integration of external knowledge. Specifically, SAC utilizes Vision-Language pre-trained Models (VLMs) to flexibly generate textual descriptions for each image, providing external semantic cues to supplement the visual information. By integrating both visual and textual information, SAC achieves image clustering through a multi-modal learning framework. To mitigate the negative impact of inaccurate textual information, SAC designs an uncertainty-driven adaptive weighting mechanism that explores both intra-modal and inter-modal neighborhood structures, and incorporates the adaptive weights into intra-modal and inter-modal contrastive learning, which improves the robustness against noisy image-text correspondences. Experiments on several popular datasets demonstrate the superiority of SAC compared to state-of-the-art methods.

Semantic-Augmented Image Clustering via Adaptive Multi-Modal Collaboration

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information