Stationary and Clustering Transformer Hashing for Cross-modal Retrieval
DOI:
https://doi.org/10.1609/aaai.v40i33.39994Abstract
Unsupervised cross-modal hashing has gained significant attention for efficient retrieval between heterogeneous modalities through encoding data into the unified binary representations, offering low storage cost and fast response. However, the constraints of existing methods persist in bridging the cross-modal semantic gap and capturing fine-grained global semantic structures without explicit labels. In this paper, we propose an innovative unsupervised Stationary distribution and soft Clustering Transformer Hashing approach for cross-modal retrieval, denoted as SCTH. Initially, a Transformer-based modality fusion encoder is employed to extract abundant cross-modal semantic representations, further integrated with contrastive hashing to minimize the semantic gap. To enhance the inter-modal alignment, a pseudo-classifier clustering module with entropy-regularized contrastive loss is presented, ensuring balanced and diverse cluster assignments in unsupervised settings. Additionally, a Markovian stationary distribution strategy stabilizes the feature representations through mitigating the interference of noise and outliers. Comprehensive experiments on MIRFlickr, NUS-WIDE, and IAPR-TC12 datasets validate that SCTH outperforms state-of-the-art hashing methods in cross-modal retrieval tasks, demonstrating superior generalization performance.Published
2026-03-14
How to Cite
Yang, Z., Liu, Y., Huang, Y., & Li, Y. (2026). Stationary and Clustering Transformer Hashing for Cross-modal Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 40(33), 27728–27737. https://doi.org/10.1609/aaai.v40i33.39994
Issue
Section
AAAI Technical Track on Machine Learning X