Noisy Correspondence Learning with Modality Gap Direction Correction
DOI:
https://doi.org/10.1609/aaai.v40i12.37984Abstract
Cross-modal retrieval is crucial for discovering latent correspondences across different modalities. However, existing methods typically assume that training data are well-aligned, an unrealistic assumption since real-world datasets inevitably contain noisy correspondences. Many current approaches attempt to handle noise using strategies borrowed from single-modal classification, such as the small-loss trick, to identify clean training pairs. However, our experiments reveal that such small-loss-based strategies are less effective for multi-modal tasks due to the inherent modality gaps. Through comprehensive analysis, we observe that the deviation directions between paired image-caption features, termed Sample-level Alignment Drift (SAD), are compact and data-dependent. Leveraging this discovery, we introduce the Modality Gap Corrected Similarity (MGCS) framework that can more accurately measure the semantic distances of cross-modal samples, dynamically compensating for misalignments. Within MGCS, we can achieve more reliable noisy data separation to promote correct supervision during cross-modal matching model training. Extensive experiments on three widely used noisy correspondence benchmarks demonstrate that MGCS significantly surpasses current state-of-the-art methods.Downloads
Published
2026-03-14
How to Cite
Wang, W., Gu, Z., & Yang, E. (2026). Noisy Correspondence Learning with Modality Gap Direction Correction. Proceedings of the AAAI Conference on Artificial Intelligence, 40(12), 10163–10171. https://doi.org/10.1609/aaai.v40i12.37984
Issue
Section
AAAI Technical Track on Computer Vision IX