Causality-Inspired Invariant Representation Learning for Text-Based Person Retrieval

Yu Liu; Guihe Qin; Haipeng Chen; Zhiyong Cheng; Xun Yang

doi:10.1609/aaai.v38i12.29314

Authors

Yu Liu College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, China
Guihe Qin College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, China
Haipeng Chen College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, China
Zhiyong Cheng Qilu University of Technology (Shandong Academy of Sciences), JiNan, China
Xun Yang University of Science and Technology of China, HeFei, China

DOI:

https://doi.org/10.1609/aaai.v38i12.29314

Keywords:

ML: Multimodal Learning, ML: Causal Learning

Abstract

Text-based Person Retrieval (TPR) aims to retrieve relevant images of specific pedestrians based on the given textual query. The mainstream approaches primarily leverage pretrained deep neural networks to learn the mapping of visual and textual modalities into a common latent space for cross-modality matching. Despite their remarkable achievements, existing efforts mainly focus on learning the statistical cross-modality correlation found in training data, other than the intrinsic causal correlation. As a result, they often struggle to retrieve accurately in the face of environmental changes such as illumination, pose, and occlusion, or when encountering images with similar attributes. In this regard, we pioneer the observation of TPR from a causal view. Specifically, we assume that each image is composed of a mixture of causal factors (which are semantically consistent with text descriptions) and non-causal factors (retrieval-irrelevant, e.g., background), and only the former can lead to reliable retrieval judgments. Our goal is to extract text-critical robust visual representation (i.e., causal factors) and establish domain invariant cross-modality correlations for accurate and reliable retrieval. However, causal/non-causal factors are unobserved, so we emphasize that ideal causal factors that can simulate causal scenes should satisfy two basic principles:1） Independence: being independent of non-causal factors, and 2）Sufficiency: being causally sufficient for TPR across different environments. Building on that, we propose an Invariant Representation Learning method for TPR (IRLT), that enforces the visual representations to satisfy the two aforementioned critical properties. Extensive experiments on three datasets clearly demonstrate the advantages of IRLT over leading baselines in terms of accuracy and generalization.

Causality-Inspired Invariant Representation Learning for Text-Based Person Retrieval

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription