Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification

Zhiwei Zhao; Bin Liu; Yan Lu; Qi Chu; Nenghai Yu

doi:10.1609/aaai.v38i7.28585

Authors

Zhiwei Zhao School of Cyber Science and Technology, University of Science and Technology of China CAS Key Laboratory of Electromagnetic Space Information
Bin Liu School of Cyber Science and Technology, University of Science and Technology of China CAS Key Laboratory of Electromagnetic Space Information
Yan Lu Shanghai AI Laboratory
Qi Chu School of Cyber Science and Technology, University of Science and Technology of China CAS Key Laboratory of Electromagnetic Space Information
Nenghai Yu School of Cyber Science and Technology, University of Science and Technology of China CAS Key Laboratory of Electromagnetic Space Information

DOI:

https://doi.org/10.1609/aaai.v38i7.28585

Keywords:

CV: Language and Vision, CV: Image and Video Retrieval

Abstract

Text-to-Image person re-identification (TI-ReID) aims to retrieve the images of target identity according to the given textual description. The existing methods in TI-ReID focus on aligning the visual and textual modalities through contrastive feature alignment or reconstructive masked language modeling (MLM). However, these methods parameterize the image/text instances as deterministic embeddings and do not explicitly consider the inherent uncertainty in pedestrian images and their textual descriptions, leading to limited image-text relationship expression and semantic alignment. To address the above problem, in this paper, we propose a novel method that unifies multi-modal uncertainty modeling and semantic alignment for TI-ReID. Specifically, we model the image and textual feature vectors of pedestrian as Gaussian distributions, where the multi-granularity uncertainty of the distribution is estimated by incorporating batch-level and identity-level feature variances for each modality. The multi-modal uncertainty modeling acts as a feature augmentation and provides richer image-text semantic relationship. Then we present a bi-directional cross-modal circle loss to more effectively align the probabilistic features between image and text in a self-paced manner. To further promote more comprehensive image-text semantic alignment, we design a task that complements the masked language modeling, focusing on the cross-modality semantic recovery of global masked token after cross-modal interaction. Extensive experiments conducted on three TI-ReID datasets highlight the effectiveness and superiority of our method over state-of-the-arts.

Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription