An Unsupervised Sampling Approach for Image-Sentence Matching Using Document-level Structural Information

Zejun Li; Zhongyu Wei; Zhihao Fan; Haijun Shan; Xuanjing Huang

doi:10.1609/aaai.v35i15.17573

Authors

Zejun Li School of Data Science, Fudan University, China
Zhongyu Wei School of Data Science, Fudan University, China Research Institute of Intelligent and Complex Systems, Fudan University, China
Zhihao Fan School of Data Science, Fudan University, China
Haijun Shan Zhejiang Lab, China
Xuanjing Huang School of Computer Science, Fudan Universit

DOI:

https://doi.org/10.1609/aaai.v35i15.17573

Keywords:

Language Grounding & Multi-modal NLP

Abstract

In this paper, we focus on the problem of unsupervised image-sentence matching. Existing research explores to utilize document-level structural information to sample positive and negative instances for model training. Although the approach achieves positive results, it introduces a sampling bias and fails to distinguish instances with high semantic similarity. To alleviate the bias, we propose a new sampling strategy to select additional intra-document image-sentence pairs as positive or negative samples. Furthermore, to recognize the complex pattern in intra-document samples, we propose a Transformer based model to capture fine-grained features and implicitly construct a graph for each document, where concepts in a document are introduced to bridge the representation learning of images and sentences in the context of a document. Experimental results show the effectiveness of our approach to alleviate the bias and learn well-aligned multimodal representations.

An Unsupervised Sampling Approach for Image-Sentence Matching Using Document-level Structural Information

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription