Multi-Prompts Learning with Cross-Modal Alignment for Attribute-Based Person Re-identification

Authors

  • Yajing Zhai College of Computer Science and Electronic Engineering, Hunan University, Changsha, China Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China
  • Yawen Zeng College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
  • Zhiyong Huang National University of Singapore, NUS Research Institute in Chongqing
  • Zheng Qin College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
  • Xin Jin Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China
  • Da Cao College of Computer Science and Electronic Engineering, Hunan University, Changsha, China

DOI:

https://doi.org/10.1609/aaai.v38i7.28524

Keywords:

CV: Image and Video Retrieval, CV: Language and Vision, CV: Multi-modal Vision, ML: Multimodal Learning

Abstract

The fine-grained attribute descriptions can significantly supplement the valuable semantic information for person image, which is vital to the success of person re-identification (ReID) task. However, current ReID algorithms typically failed to effectively leverage the rich contextual information available, primarily due to their reliance on simplistic and coarse utilization of image attributes. Recent advances in artificial intelligence generated content have made it possible to automatically generate plentiful fine-grained attribute descriptions and make full use of them. Thereby, this paper explores the potential of using the generated multiple person attributes as prompts in ReID tasks with off-the-shelf (large) models for more accurate retrieval results. To this end, we present a new framework called Multi-Prompts ReID (MP-ReID), based on prompt learning and language models, to fully dip fine attributes to assist ReID task. Specifically, MP-ReID first learns to hallucinate diverse, informative, and promptable sentences for describing the query images. This procedure includes (i) explicit prompts of which attributes a person has and furthermore (ii) implicit learnable prompts for adjusting/conditioning the criteria used towards this person identity matching. Explicit prompts are obtained by ensembling generation models, such as ChatGPT and VQA models. Moreover, an alignment module is designed to fuse multi-prompts (i.e., explicit and implicit ones) progressively and mitigate the cross-modal gap. Extensive experiments on the existing attribute-involved ReID datasets, namely, Market1501 and DukeMTMC-reID, demonstrate the effectiveness and rationality of the proposed MP-ReID solution.

Published

2024-03-24

How to Cite

Zhai, Y., Zeng, Y., Huang, Z. ., Qin, Z., Jin, X., & Cao, D. (2024). Multi-Prompts Learning with Cross-Modal Alignment for Attribute-Based Person Re-identification. Proceedings of the AAAI Conference on Artificial Intelligence, 38(7), 6979-6987. https://doi.org/10.1609/aaai.v38i7.28524

Issue

Section

AAAI Technical Track on Computer Vision VI