Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification

Authors

  • Zhiwei Zhao School of Cyber Science and Technology, University of Science and Technology of China CAS Key Laboratory of Electromagnetic Space Information
  • Bin Liu School of Cyber Science and Technology, University of Science and Technology of China CAS Key Laboratory of Electromagnetic Space Information
  • Yan Lu Shanghai AI Laboratory
  • Qi Chu School of Cyber Science and Technology, University of Science and Technology of China CAS Key Laboratory of Electromagnetic Space Information
  • Nenghai Yu School of Cyber Science and Technology, University of Science and Technology of China CAS Key Laboratory of Electromagnetic Space Information

DOI:

https://doi.org/10.1609/aaai.v38i7.28585

Keywords:

CV: Language and Vision, CV: Image and Video Retrieval

Abstract

Text-to-Image person re-identification (TI-ReID) aims to retrieve the images of target identity according to the given textual description. The existing methods in TI-ReID focus on aligning the visual and textual modalities through contrastive feature alignment or reconstructive masked language modeling (MLM). However, these methods parameterize the image/text instances as deterministic embeddings and do not explicitly consider the inherent uncertainty in pedestrian images and their textual descriptions, leading to limited image-text relationship expression and semantic alignment. To address the above problem, in this paper, we propose a novel method that unifies multi-modal uncertainty modeling and semantic alignment for TI-ReID. Specifically, we model the image and textual feature vectors of pedestrian as Gaussian distributions, where the multi-granularity uncertainty of the distribution is estimated by incorporating batch-level and identity-level feature variances for each modality. The multi-modal uncertainty modeling acts as a feature augmentation and provides richer image-text semantic relationship. Then we present a bi-directional cross-modal circle loss to more effectively align the probabilistic features between image and text in a self-paced manner. To further promote more comprehensive image-text semantic alignment, we design a task that complements the masked language modeling, focusing on the cross-modality semantic recovery of global masked token after cross-modal interaction. Extensive experiments conducted on three TI-ReID datasets highlight the effectiveness and superiority of our method over state-of-the-arts.

Published

2024-03-24

How to Cite

Zhao, Z., Liu, B., Lu, Y., Chu, Q., & Yu, N. (2024). Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification. Proceedings of the AAAI Conference on Artificial Intelligence, 38(7), 7534-7542. https://doi.org/10.1609/aaai.v38i7.28585

Issue

Section

AAAI Technical Track on Computer Vision VI