NLIP: Noise-Robust Language-Image Pre-training

Authors

  • Runhui Huang Shenzhen campus of Sun Yat-sen University
  • Yanxin Long Shenzhen campus of Sun Yat-sen University
  • Jianhua Han Huawei Noah's Ark Lab
  • Hang Xu Huawei Noah's Ark Lab
  • Xiwen Liang Shenzhen campus of Sun Yat-sen University
  • Chunjing Xu Huawei Noah's Ark Lab
  • Xiaodan Liang Shenzhen campus of Sun Yat-sen University

DOI:

https://doi.org/10.1609/aaai.v37i1.25172

Keywords:

CV: Language and Vision, CV: Representation Learning for Vision

Abstract

Large-scale cross-modal pre-training paradigms have recently shown ubiquitous success on a wide range of downstream tasks, e.g., zero-shot classification, retrieval and image captioning. However, their successes highly rely on the scale and quality of web-crawled data that naturally contain much incomplete and noisy information (e.g., wrong or irrelevant contents). Existing works either design manual rules to clean data or generate pseudo-targets as auxiliary signals for reducing noise impact, which do not explicitly tackle both the incorrect and incomplete challenges at the same time. In this paper, to automatically mitigate the impact of noise by solely mining over existing data, we propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion. First, in noise-harmonization scheme, NLIP estimates the noise probability of each pair according to the memorization effect of cross-modal transformers, then adopts noise-adaptive regularization to harmonize the cross-modal alignments with varying degrees. Second, in noise-completion scheme, to enrich the missing object information of text, NLIP injects a concept-conditioned cross-modal decoder to obtain semantic-consistent synthetic captions to complete noisy ones, which uses the retrieved visual concepts (i.e., objects’ names) for the corresponding image to guide captioning generation. By collaboratively optimizing noise-harmonization and noise-completion schemes, our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way. Extensive experiments show the significant performance improvements of our NLIP using only 26M data over existing pre-trained models (e.g., CLIP, FILIP and BLIP) on 12 zero-shot classification datasets (e.g., +8.6% over CLIP on average accuracy), MSCOCO image captioning (e.g., +1.9 over BLIP trained with 129M data on CIDEr) and zero-shot image-text retrieval tasks.

Downloads

Published

2023-06-26

How to Cite

Huang, R., Long, Y., Han, J., Xu, H., Liang, X., Xu, C., & Liang, X. (2023). NLIP: Noise-Robust Language-Image Pre-training. Proceedings of the AAAI Conference on Artificial Intelligence, 37(1), 926-934. https://doi.org/10.1609/aaai.v37i1.25172

Issue

Section

AAAI Technical Track on Computer Vision I