Noisy Correspondence Learning with Modality Gap Direction Correction

Wuyuqing Wang; Zeyuan Gu; Erkun Yang

doi:10.1609/aaai.v40i12.37984

Authors

Wuyuqing Wang School of Electronic Engineering, Xidian University, Xi’an 710071, China
Zeyuan Gu School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China
Erkun Yang School of Electronic Engineering, Xidian University, Xi’an 710071, China

DOI:

https://doi.org/10.1609/aaai.v40i12.37984

Abstract

Cross-modal retrieval is crucial for discovering latent correspondences across different modalities. However, existing methods typically assume that training data are well-aligned, an unrealistic assumption since real-world datasets inevitably contain noisy correspondences. Many current approaches attempt to handle noise using strategies borrowed from single-modal classification, such as the small-loss trick, to identify clean training pairs. However, our experiments reveal that such small-loss-based strategies are less effective for multi-modal tasks due to the inherent modality gaps. Through comprehensive analysis, we observe that the deviation directions between paired image-caption features, termed Sample-level Alignment Drift (SAD), are compact and data-dependent. Leveraging this discovery, we introduce the Modality Gap Corrected Similarity (MGCS) framework that can more accurately measure the semantic distances of cross-modal samples, dynamically compensating for misalignments. Within MGCS, we can achieve more reliable noisy data separation to promote correct supervision during cross-modal matching model training. Extensive experiments on three widely used noisy correspondence benchmarks demonstrate that MGCS significantly surpasses current state-of-the-art methods.

Noisy Correspondence Learning with Modality Gap Direction Correction

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information