Semantic-Augmented Image Clustering via Adaptive Multi-Modal Collaboration

Authors

  • Xiaohan Zhang Nanjing University
  • Chao Zhang Nanjing University
  • Deng Xu Nanjing University
  • Hong YU Chongqing University of Post and Telecommunications
  • Chunlin Chen Nanjing University
  • Huaxiong Li Nanjing University

DOI:

https://doi.org/10.1609/aaai.v40i33.40073

Abstract

Image clustering is a fundamental task in unsupervised visual learning. While recent self-supervised methods have explored various pretext tasks to generate supervision signals for clustering, they typically depend exclusively on raw images, resulting in insufficient supervision signals that are inherently constrained by limited visual semantics. In this paper, we propose a novel Semantic-Augmented image Clustering (SAC) method, which transcends the inherent limitations of purely visual representations through the integration of external knowledge. Specifically, SAC utilizes Vision-Language pre-trained Models (VLMs) to flexibly generate textual descriptions for each image, providing external semantic cues to supplement the visual information. By integrating both visual and textual information, SAC achieves image clustering through a multi-modal learning framework. To mitigate the negative impact of inaccurate textual information, SAC designs an uncertainty-driven adaptive weighting mechanism that explores both intra-modal and inter-modal neighborhood structures, and incorporates the adaptive weights into intra-modal and inter-modal contrastive learning, which improves the robustness against noisy image-text correspondences. Experiments on several popular datasets demonstrate the superiority of SAC compared to state-of-the-art methods.

Downloads

Published

2026-03-14

How to Cite

Zhang, X., Zhang, C., Xu, D., YU, H., Chen, C., & Li, H. (2026). Semantic-Augmented Image Clustering via Adaptive Multi-Modal Collaboration. Proceedings of the AAAI Conference on Artificial Intelligence, 40(33), 28437–28445. https://doi.org/10.1609/aaai.v40i33.40073

Issue

Section

AAAI Technical Track on Machine Learning X