Causality-Inspired Invariant Representation Learning for Text-Based Person Retrieval

Authors

  • Yu Liu College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, China
  • Guihe Qin College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, China
  • Haipeng Chen College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, China
  • Zhiyong Cheng Qilu University of Technology (Shandong Academy of Sciences), JiNan, China
  • Xun Yang University of Science and Technology of China, HeFei, China

DOI:

https://doi.org/10.1609/aaai.v38i12.29314

Keywords:

ML: Multimodal Learning, ML: Causal Learning

Abstract

Text-based Person Retrieval (TPR) aims to retrieve relevant images of specific pedestrians based on the given textual query. The mainstream approaches primarily leverage pretrained deep neural networks to learn the mapping of visual and textual modalities into a common latent space for cross-modality matching. Despite their remarkable achievements, existing efforts mainly focus on learning the statistical cross-modality correlation found in training data, other than the intrinsic causal correlation. As a result, they often struggle to retrieve accurately in the face of environmental changes such as illumination, pose, and occlusion, or when encountering images with similar attributes. In this regard, we pioneer the observation of TPR from a causal view. Specifically, we assume that each image is composed of a mixture of causal factors (which are semantically consistent with text descriptions) and non-causal factors (retrieval-irrelevant, e.g., background), and only the former can lead to reliable retrieval judgments. Our goal is to extract text-critical robust visual representation (i.e., causal factors) and establish domain invariant cross-modality correlations for accurate and reliable retrieval. However, causal/non-causal factors are unobserved, so we emphasize that ideal causal factors that can simulate causal scenes should satisfy two basic principles:1) Independence: being independent of non-causal factors, and 2)Sufficiency: being causally sufficient for TPR across different environments. Building on that, we propose an Invariant Representation Learning method for TPR (IRLT), that enforces the visual representations to satisfy the two aforementioned critical properties. Extensive experiments on three datasets clearly demonstrate the advantages of IRLT over leading baselines in terms of accuracy and generalization.

Published

2024-03-24

How to Cite

Liu, Y., Qin, G., Chen, H., Cheng, Z., & Yang, X. (2024). Causality-Inspired Invariant Representation Learning for Text-Based Person Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 38(12), 14052-14060. https://doi.org/10.1609/aaai.v38i12.29314

Issue

Section

AAAI Technical Track on Machine Learning III