TSVC: Tripartite Learning with Semantic Variation Consistency for Robust Image-Text Retrieval

Shuai Lyu; Zijing Tian; Zhonghong Ou; Yifan Zhu; Xiao Zhang; Qiankun Ha; Haoran Luo; Meina Song

doi:10.1609/aaai.v39i18.34121

Authors

Shuai Lyu School of Computer Science, Beijing University of Posts and Telecommunications, China
Zijing Tian School of Science, Beijing University of Posts and Telecommunications, China
Zhonghong Ou State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China
Yifan Zhu School of Computer Science, Beijing University of Posts and Telecommunications, China
Xiao Zhang School of Computer Science, Beijing University of Posts and Telecommunications, China
Qiankun Ha School of Computer Science, Beijing University of Posts and Telecommunications, China
Haoran Luo School of Computer Science, Beijing University of Posts and Telecommunications, China
Meina Song School of Computer Science, Beijing University of Posts and Telecommunications, China

DOI:

https://doi.org/10.1609/aaai.v39i18.34121

Abstract

Cross-modal retrieval maps data under different modalities via semantic relevance. Existing approaches implicitly assume that data pairs are well-aligned and ignore the widely existing annotation noise, i.e., noisy correspondence (NC). Consequently, it inevitably causes performance degradation. Despite attempts that employ the co-teaching paradigm with identical architectures to provide distinct data perspectives, the differences between these architectures primarily stem from random initialization. Thus, the model becomes increasingly homogeneous along with the training process. Consequently, the additional information brought by this paradigm is severely limited. In order to resolve this problem, we introduce Tripartite Learning with Semantic Variation Consistency (TSVC) for robust image-text retrieval. We design a tripartite cooperative learning mechanism comprising a Coordinator, a Master, and an Assistant model. The Coordinator distributes data, and the Assistant model supports the Master model's noisy label prediction with diverse data. Moreover, we introduce a soft label estimation method based on mutual information variation, which quantifies the noise in new samples and assigns corresponding soft labels. We also present a new loss function to enhance robustness and optimize training effectiveness. Extensive experiments on three widely used datasets demonstrate that, even at increasing noise ratios, TSVC exhibits significant advantages in retrieval accuracy and maintains stable training performance.

TSVC: Tripartite Learning with Semantic Variation Consistency for Robust Image-Text Retrieval

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information