Self-Supervised Multi-Modal Knowledge Graph Contrastive Hashing for Cross-Modal Search

Authors

  • Meiyu Liang Beijing University of Posts and Telecommunications
  • Junping Du Beijing University of Posts and Telecommunications
  • Zhengyang Liang Beijing University of Posts and Telecommunications
  • Yongwang Xing Beijing University of Posts and Telecommunications
  • Wei Huang Beijing University of Posts and Telecommunications
  • Zhe Xue Beijing University of Posts and Telecommunications

DOI:

https://doi.org/10.1609/aaai.v38i12.29280

Keywords:

ML: Multimodal Learning, CV: Language and Vision

Abstract

Deep cross-modal hashing technology provides an effective and efficient cross-modal unified representation learning solution for cross-modal search. However, the existing methods neglect the implicit fine-grained multimodal knowledge relations between these modalities such as when the image contains information that is not directly described in the text. To tackle this problem, we propose a novel self-supervised multi-grained multi-modal knowledge graph contrastive hashing method for cross-modal search (CMGCH). Firstly, in order to capture implicit fine-grained cross-modal semantic associations, a multi-modal knowledge graph is constructed, which represents the implicit multimodal knowledge relations between the image and text as inter-modal and intra-modal semantic associations. Secondly, a cross-modal graph contrastive attention network is proposed to reason on the multi-modal knowledge graph to sufficiently learn the implicit fine-grained inter-modal and intra-modal knowledge relations. Thirdly, a cross-modal multi-granularity contrastive embedding learning mechanism is proposed, which fuses the global coarse-grained and local fine-grained embeddings by multihead attention mechanism for inter-modal and intra-modal contrastive learning, so as to enhance the cross-modal unified representations with stronger discriminativeness and semantic consistency preserving power. With the joint training of intra-modal and inter-modal contrast, the invariant and modal-specific information of different modalities can be maintained in the final unified cross-modal unified hash space. Extensive experiments on several cross-modal benchmark datasets demonstrate that the proposed CMGCH outperforms the state-of the-art methods.

Downloads

Published

2024-03-24

How to Cite

Liang, M., Du, J., Liang, Z., Xing, Y., Huang, W., & Xue, Z. (2024). Self-Supervised Multi-Modal Knowledge Graph Contrastive Hashing for Cross-Modal Search. Proceedings of the AAAI Conference on Artificial Intelligence, 38(12), 13744-13753. https://doi.org/10.1609/aaai.v38i12.29280

Issue

Section

AAAI Technical Track on Machine Learning III