Identification of Necessary Semantic Undertakers in the Causal View for Image-Text Matching

Authors

  • Huatian Zhang University of Science and Technology of China
  • Lei Zhang University of Science and Technology of China
  • Kun Zhang University of Science and Technology of China
  • Zhendong Mao University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v38i7.28538

Keywords:

CV: Language and Vision, ML: Multimodal Learning, ML: Causal Learning

Abstract

Image-text matching bridges vision and language, which is a fundamental task in multimodal intelligence. Its key challenge lies in how to capture visual-semantic relevance. Fine-grained semantic interactions come from fragment alignments between image regions and text words. However, not all fragments contribute to image-text relevance, and many existing methods are devoted to mining the vital ones to measure the relevance accurately. How well image and text relate depends on the degree of semantic sharing between them. Treating the degree as an effect and fragments as its possible causes, we define those indispensable causes for the generation of the degree as necessary undertakers, i.e., if any of them did not occur, the relevance would be no longer valid. In this paper, we revisit image-text matching in the causal view and uncover inherent causal properties of relevance generation. Then we propose a novel theoretical prototype for estimating the probability-of-necessity of fragments, PN_f, for the degree of semantic sharing by means of causal inference, and further design a Necessary Undertaker Identification Framework (NUIF) for image-text matching, which explicitly formalizes the fragment's contribution to image-text relevance by modeling PN_f in two ways. Extensive experiments show our method achieves state-of-the-art on benchmarks Flickr30K and MSCOCO.

Published

2024-03-24

How to Cite

Zhang, H., Zhang, L., Zhang, K., & Mao, Z. (2024). Identification of Necessary Semantic Undertakers in the Causal View for Image-Text Matching. Proceedings of the AAAI Conference on Artificial Intelligence, 38(7), 7105-7114. https://doi.org/10.1609/aaai.v38i7.28538

Issue

Section

AAAI Technical Track on Computer Vision VI