KPDM: Key Phrase Dynamic Masking for Robust Text-to-Image Person Retrieval

Shaofeng You; Tianle Miao; Qihang Chen; Xin Li; Zhuo Cheng; Dapeng Luo

doi:10.1609/aaai.v40i14.38199

Authors

Shaofeng You School of Mechanical Engineering and Electronic Information, China University of Geosciences, Wuhan 430074, China
Tianle Miao School of Mechanical Engineering and Electronic Information, China University of Geosciences, Wuhan 430074, China
Qihang Chen School of Mechanical Engineering and Electronic Information, China University of Geosciences, Wuhan 430074, China
Xin Li Intelligent Technology Co., Ltd of Chinese Construction Third Engineering Bureau
Zhuo Cheng School of Mechanical Engineering and Electronic Information, China University of Geosciences, Wuhan 430074, China
Dapeng Luo School of Mechanical Engineering and Electronic Information, China University of Geosciences, Wuhan 430074, China

DOI:

https://doi.org/10.1609/aaai.v40i14.38199

Abstract

Text-to-image person re-identification (TIReID) aims to retrieve the most relevant pedestrian images from an image gallery based on natural language descriptions. Recent studies have achieved significant performance improvements by leveraging Masked Language Modeling (MLM) to align fine-grained information through local matching. However, in the text feature extraction, randomly masking text tokens may disrupt the semantic relationships between these local tokens, leading to feature misalignment; on the other hand, from an image feature perspective, redundant patches in pedestrian images hinder the information interaction across modalities. Moreover, the presence of noisy image-text pairs further complicates the learning process, as the model may be misled into recognizing incorrect patterns. To address these issues, we propose a robust fine-grained local alignment framework based on Key Phrase Dynamic Mask (KPDM). First, we strengthen the semantic relationships between text tokens by implementing a "adjective + noun'' phrase-level masking strategy, and design a frequency-based masked language loss (FMLM) to supervise fine-grained semantic-level local alignment. Second, we integrate cross-layer importance estimation to highlight key pedestrian image representations while removing redundant image features. Third, we propose a trusted consensus partitioning mechanism, utilizing intra-identity image-text similarity distributions to identify noisy pairs, enhancing the model robustness. Extensive experiments show that our method achieves 67.95% Rank-1 and 51.88% mAP on the RSTPReid dataset, exceeding the previous state-of-the-art by 2.6% and 1%. Furthermore, KPDM achieves Rank-1 accuracies of 75.97% on the CUHK-PEDES dataset and 67.78% on the ICFG-PEDES dataset, outperforming earlier methods.

KPDM: Key Phrase Dynamic Masking for Robust Text-to-Image Person Retrieval

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information