Noise-Aware Image Captioning with Progressively Exploring Mismatched Words
DOI:
https://doi.org/10.1609/aaai.v38i11.29097Keywords:
ML: Multimodal Learning, CV: Language and VisionAbstract
Image captioning aims to automatically generate captions for images by learning a cross-modal generator from vision to language. The large amount of image-text pairs required for training is usually sourced from the internet due to the manual cost, which brings the noise with mismatched relevance that affects the learning process. Unlike traditional noisy label learning, the key challenge in processing noisy image-text pairs is to finely identify the mismatched words to make the most use of trustworthy information in the text, rather than coarsely weighing the entire examples. To tackle this challenge, we propose a Noise-aware Image Captioning method (NIC) to adaptively mitigate the erroneous guidance from noise by progressively exploring mismatched words. Specifically, NIC first identifies mismatched words by quantifying word-label reliability from two aspects: 1) inter-modal representativeness, which measures the significance of the current word by assessing cross-modal correlation via prediction certainty; 2) intra-modal informativeness, which amplifies the effect of current prediction by combining the quality of subsequent word generation. During optimization, NIC constructs the pseudo-word-labels considering the reliability of the origin word-labels and model convergence to periodically coordinate mismatched words. As a result, NIC can effectively exploit both clean and noisy image-text pairs to learn a more robust mapping function. Extensive experiments conducted on the MS-COCO and Conceptual Caption datasets validate the effectiveness of our method in various noisy scenarios.Downloads
Published
2024-03-24
How to Cite
Fu, Z., Song, K., Zhou, L., & Yang, Y. (2024). Noise-Aware Image Captioning with Progressively Exploring Mismatched Words. Proceedings of the AAAI Conference on Artificial Intelligence, 38(11), 12091-12099. https://doi.org/10.1609/aaai.v38i11.29097
Issue
Section
AAAI Technical Track on Machine Learning II