Noise-Aware Image Captioning with Progressively Exploring Mismatched Words

Zhongtian Fu; Kefei Song; Luping Zhou; Yang Yang

doi:10.1609/aaai.v38i11.29097

Authors

Zhongtian Fu Nanjing University of Science and Technology, Nanjing 210094, China
Kefei Song Nanjing University of Science and Technology, Nanjing 210094, China
Luping Zhou The University of Sydney, Sydney 2052, Australia
Yang Yang Nanjing University of Science and Technology, Nanjing 210094, China

DOI:

https://doi.org/10.1609/aaai.v38i11.29097

Keywords:

ML: Multimodal Learning, CV: Language and Vision

Abstract

Image captioning aims to automatically generate captions for images by learning a cross-modal generator from vision to language. The large amount of image-text pairs required for training is usually sourced from the internet due to the manual cost, which brings the noise with mismatched relevance that affects the learning process. Unlike traditional noisy label learning, the key challenge in processing noisy image-text pairs is to finely identify the mismatched words to make the most use of trustworthy information in the text, rather than coarsely weighing the entire examples. To tackle this challenge, we propose a Noise-aware Image Captioning method (NIC) to adaptively mitigate the erroneous guidance from noise by progressively exploring mismatched words. Specifically, NIC first identifies mismatched words by quantifying word-label reliability from two aspects: 1) inter-modal representativeness, which measures the significance of the current word by assessing cross-modal correlation via prediction certainty; 2) intra-modal informativeness, which amplifies the effect of current prediction by combining the quality of subsequent word generation. During optimization, NIC constructs the pseudo-word-labels considering the reliability of the origin word-labels and model convergence to periodically coordinate mismatched words. As a result, NIC can effectively exploit both clean and noisy image-text pairs to learn a more robust mapping function. Extensive experiments conducted on the MS-COCO and Conceptual Caption datasets validate the effectiveness of our method in various noisy scenarios.

Noise-Aware Image Captioning with Progressively Exploring Mismatched Words

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription