KPDM: Key Phrase Dynamic Masking for Robust Text-to-Image Person Retrieval

Authors

  • Shaofeng You School of Mechanical Engineering and Electronic Information, China University of Geosciences, Wuhan 430074, China
  • Tianle Miao School of Mechanical Engineering and Electronic Information, China University of Geosciences, Wuhan 430074, China
  • Qihang Chen School of Mechanical Engineering and Electronic Information, China University of Geosciences, Wuhan 430074, China
  • Xin Li Intelligent Technology Co., Ltd of Chinese Construction Third Engineering Bureau
  • Zhuo Cheng School of Mechanical Engineering and Electronic Information, China University of Geosciences, Wuhan 430074, China
  • Dapeng Luo School of Mechanical Engineering and Electronic Information, China University of Geosciences, Wuhan 430074, China

DOI:

https://doi.org/10.1609/aaai.v40i14.38199

Abstract

Text-to-image person re-identification (TIReID) aims to retrieve the most relevant pedestrian images from an image gallery based on natural language descriptions. Recent studies have achieved significant performance improvements by leveraging Masked Language Modeling (MLM) to align fine-grained information through local matching. However, in the text feature extraction, randomly masking text tokens may disrupt the semantic relationships between these local tokens, leading to feature misalignment; on the other hand, from an image feature perspective, redundant patches in pedestrian images hinder the information interaction across modalities. Moreover, the presence of noisy image-text pairs further complicates the learning process, as the model may be misled into recognizing incorrect patterns. To address these issues, we propose a robust fine-grained local alignment framework based on Key Phrase Dynamic Mask (KPDM). First, we strengthen the semantic relationships between text tokens by implementing a "adjective + noun'' phrase-level masking strategy, and design a frequency-based masked language loss (FMLM) to supervise fine-grained semantic-level local alignment. Second, we integrate cross-layer importance estimation to highlight key pedestrian image representations while removing redundant image features. Third, we propose a trusted consensus partitioning mechanism, utilizing intra-identity image-text similarity distributions to identify noisy pairs, enhancing the model robustness. Extensive experiments show that our method achieves 67.95% Rank-1 and 51.88% mAP on the RSTPReid dataset, exceeding the previous state-of-the-art by 2.6% and 1%. Furthermore, KPDM achieves Rank-1 accuracies of 75.97% on the CUHK-PEDES dataset and 67.78% on the ICFG-PEDES dataset, outperforming earlier methods.

Downloads

Published

2026-03-14

How to Cite

You, S., Miao, T., Chen, Q., Li, X., Cheng, Z., & Luo, D. (2026). KPDM: Key Phrase Dynamic Masking for Robust Text-to-Image Person Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 40(14), 12099–12107. https://doi.org/10.1609/aaai.v40i14.38199

Issue

Section

AAAI Technical Track on Computer Vision XI