Self-Supervised Multi-Modal Knowledge Graph Contrastive Hashing for Cross-Modal Search

Meiyu Liang; Junping Du; Zhengyang Liang; Yongwang Xing; Wei Huang; Zhe Xue

doi:10.1609/aaai.v38i12.29280

Authors

Meiyu Liang Beijing University of Posts and Telecommunications
Junping Du Beijing University of Posts and Telecommunications
Zhengyang Liang Beijing University of Posts and Telecommunications
Yongwang Xing Beijing University of Posts and Telecommunications
Wei Huang Beijing University of Posts and Telecommunications
Zhe Xue Beijing University of Posts and Telecommunications

DOI:

https://doi.org/10.1609/aaai.v38i12.29280

Keywords:

ML: Multimodal Learning, CV: Language and Vision

Abstract

Deep cross-modal hashing technology provides an effective and efficient cross-modal unified representation learning solution for cross-modal search. However, the existing methods neglect the implicit fine-grained multimodal knowledge relations between these modalities such as when the image contains information that is not directly described in the text. To tackle this problem, we propose a novel self-supervised multi-grained multi-modal knowledge graph contrastive hashing method for cross-modal search (CMGCH). Firstly, in order to capture implicit fine-grained cross-modal semantic associations, a multi-modal knowledge graph is constructed, which represents the implicit multimodal knowledge relations between the image and text as inter-modal and intra-modal semantic associations. Secondly, a cross-modal graph contrastive attention network is proposed to reason on the multi-modal knowledge graph to sufficiently learn the implicit fine-grained inter-modal and intra-modal knowledge relations. Thirdly, a cross-modal multi-granularity contrastive embedding learning mechanism is proposed, which fuses the global coarse-grained and local fine-grained embeddings by multihead attention mechanism for inter-modal and intra-modal contrastive learning, so as to enhance the cross-modal unified representations with stronger discriminativeness and semantic consistency preserving power. With the joint training of intra-modal and inter-modal contrast, the invariant and modal-specific information of different modalities can be maintained in the final unified cross-modal unified hash space. Extensive experiments on several cross-modal benchmark datasets demonstrate that the proposed CMGCH outperforms the state-of the-art methods.

Self-Supervised Multi-Modal Knowledge Graph Contrastive Hashing for Cross-Modal Search

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information