Noise-Aware Image Captioning with Progressively Exploring Mismatched Words

Authors

  • Zhongtian Fu Nanjing University of Science and Technology, Nanjing 210094, China
  • Kefei Song Nanjing University of Science and Technology, Nanjing 210094, China
  • Luping Zhou The University of Sydney, Sydney 2052, Australia
  • Yang Yang Nanjing University of Science and Technology, Nanjing 210094, China

DOI:

https://doi.org/10.1609/aaai.v38i11.29097

Keywords:

ML: Multimodal Learning, CV: Language and Vision

Abstract

Image captioning aims to automatically generate captions for images by learning a cross-modal generator from vision to language. The large amount of image-text pairs required for training is usually sourced from the internet due to the manual cost, which brings the noise with mismatched relevance that affects the learning process. Unlike traditional noisy label learning, the key challenge in processing noisy image-text pairs is to finely identify the mismatched words to make the most use of trustworthy information in the text, rather than coarsely weighing the entire examples. To tackle this challenge, we propose a Noise-aware Image Captioning method (NIC) to adaptively mitigate the erroneous guidance from noise by progressively exploring mismatched words. Specifically, NIC first identifies mismatched words by quantifying word-label reliability from two aspects: 1) inter-modal representativeness, which measures the significance of the current word by assessing cross-modal correlation via prediction certainty; 2) intra-modal informativeness, which amplifies the effect of current prediction by combining the quality of subsequent word generation. During optimization, NIC constructs the pseudo-word-labels considering the reliability of the origin word-labels and model convergence to periodically coordinate mismatched words. As a result, NIC can effectively exploit both clean and noisy image-text pairs to learn a more robust mapping function. Extensive experiments conducted on the MS-COCO and Conceptual Caption datasets validate the effectiveness of our method in various noisy scenarios.

Downloads

Published

2024-03-24

How to Cite

Fu, Z., Song, K., Zhou, L., & Yang, Y. (2024). Noise-Aware Image Captioning with Progressively Exploring Mismatched Words. Proceedings of the AAAI Conference on Artificial Intelligence, 38(11), 12091-12099. https://doi.org/10.1609/aaai.v38i11.29097

Issue

Section

AAAI Technical Track on Machine Learning II