Stationary and Clustering Transformer Hashing for Cross-modal Retrieval

Zhan Yang; Yiran Liu; Youyuan Huang; Yinan Li

doi:10.1609/aaai.v40i33.39994

Authors

Zhan Yang Central South University
Yiran Liu Central South University
Youyuan Huang Central South University
Yinan Li Central South University

DOI:

https://doi.org/10.1609/aaai.v40i33.39994

Abstract

Unsupervised cross-modal hashing has gained significant attention for efficient retrieval between heterogeneous modalities through encoding data into the unified binary representations, offering low storage cost and fast response. However, the constraints of existing methods persist in bridging the cross-modal semantic gap and capturing fine-grained global semantic structures without explicit labels. In this paper, we propose an innovative unsupervised Stationary distribution and soft Clustering Transformer Hashing approach for cross-modal retrieval, denoted as SCTH. Initially, a Transformer-based modality fusion encoder is employed to extract abundant cross-modal semantic representations, further integrated with contrastive hashing to minimize the semantic gap. To enhance the inter-modal alignment, a pseudo-classifier clustering module with entropy-regularized contrastive loss is presented, ensuring balanced and diverse cluster assignments in unsupervised settings. Additionally, a Markovian stationary distribution strategy stabilizes the feature representations through mitigating the interference of noise and outliers. Comprehensive experiments on MIRFlickr, NUS-WIDE, and IAPR-TC12 datasets validate that SCTH outperforms state-of-the-art hashing methods in cross-modal retrieval tasks, demonstrating superior generalization performance.

Stationary and Clustering Transformer Hashing for Cross-modal Retrieval

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information