Vision-guided Text Mining for Unsupervised Cross-modal Hashing with Community Similarity Quantization

Authors

  • Haozhi Fan School of Engineering and Applied Science, University of Pennsylvania, USA
  • Yuan Cao School of Computer Science and Technology, Ocean University of China, China

DOI:

https://doi.org/10.1609/aaai.v39i3.32290

Abstract

Cross-modal retrieval, as an emerging field within multimedia research, has gained significant attention in recent years. Unsupervised cross-modal hashing methods are attractive due to their ability to capture latent relationships within the data without label supervision and to produce compact hash codes for high search efficiency. However, the text modality exhibits worse representation ability compared with the image modality, leading to weak guidance to construct the joint similarity matrix. Moreover, most unsupervised cross-modal hashing methods are based on pairwise similarities for training, resulting in non-aggregating data distribution in the hash space. In this paper, we propose a novel Vision-guided Text Mining for Unsupervised Cross-modal Hashing via Community Similarity Quantization, termed VTM-UCH. Specifically, we first find the one-to-one correspondence between each word and each vision (image or object) based on the Contrastive Language-Image Pre-training (CLIP) model and compute the text similarities according to the clustering of their corresponding visions. Then, we define the fine-grained object-level image similarities and design the joint similarity matrix based on the text and image similarities. Accordingly, we construct an undirected graph to compute the communities as the pseudo-centers and adjust the pairwise similarities to improve the hash codes distribution. The experimental results on two common datasets verify the accuracy improvements in comparison with state-of-the-art baselines.

Downloads

Published

2025-04-11

How to Cite

Fan, H., & Cao, Y. (2025). Vision-guided Text Mining for Unsupervised Cross-modal Hashing with Community Similarity Quantization. Proceedings of the AAAI Conference on Artificial Intelligence, 39(3), 2843–2851. https://doi.org/10.1609/aaai.v39i3.32290

Issue

Section

AAAI Technical Track on Computer Vision II