Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching

Authors

  • Huatian Zhang University of Science and Technology of China
  • Zhendong Mao University of Science and Technology of China
  • Kun Zhang University of Science and Technology of China
  • Yongdong Zhang University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v36i3.20235

Keywords:

Computer Vision (CV)

Abstract

Image-text matching bridges vision and language, which is a crucial task in the field of multi-modal intelligence. The key challenge lies in how to measure image-text relevance accurately as matching evidence. Most existing works aggregate the local semantic similarities of matched region-word pairs as the overall relevance, and they typically assume that the matched pairs are equally reliable. However, although a region-word pair is locally matched across modalities, it may be inconsistent/unreliable from the global perspective of image-text, resulting in inaccurate relevance measurement. In this paper, we propose a novel Cross-Modal Confidence-Aware Network to infer the matching confidence that indicates the reliability of matched region-word pairs, which is combined with the local semantic similarities to refine the relevance measurement. Specifically, we first calculate the matching confidence via the relevance between the semantic of image regions and the complete described semantic in the image, with the text as a bridge. Further, to richly express the region semantics, we extend the region to its visual context in the image. Then, local semantic similarities are weighted with the inferred confidence to filter out unreliable matched pairs in aggregating. Comprehensive experiments show that our method achieves state-of-the-art performance on benchmarks Flickr30K and MSCOCO.

Downloads

Published

2022-06-28

How to Cite

Zhang, H., Mao, Z., Zhang, K., & Zhang, Y. (2022). Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching. Proceedings of the AAAI Conference on Artificial Intelligence, 36(3), 3262-3270. https://doi.org/10.1609/aaai.v36i3.20235

Issue

Section

AAAI Technical Track on Computer Vision III