Exploring the Potential of Large Vision-Language Models for Unsupervised Text-Based Person Retrieval

Authors

  • Zongyi Li Huazhong University of Science and Technology
  • Li Jianbo Huazhong University of Science and Technology
  • Yuxuan Shi National Engineering Research Center of Educational Big Data and the Faculty of Artificial Intelligence in Education, Central China Normal University
  • Jiazhong Chen Huazhong University of Science and Technology
  • Shijuan Huang Huazhong University of Science and Technology
  • Linnan Tu Huazhong University of Science and Technology
  • Fei Shen Nanjing University of Science and Technology
  • Hefei Ling Huazhong University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v39i5.32543

Abstract

The aim of text-based person retrieval is to identify pedestrians using natural language descriptions within a large-scale image gallery. Traditional methods rely heavily on manually annotated image-text pairs, which are resource-intensive to obtain. With the emergence of Large Vision-Language Models (LVLMs), the advanced capabilities of contemporary models in image understanding have led to the generation of highly accurate captions. Therefore, this paper explores the potential of employing Large Vision-Language Models for unsupervised text-based pedestrian image retrieval and proposes a Multi-grained Uncertainty Modeling and Alignment framework (MUMA). Initially, multiple Large Vision-Language Models are employed to generate diverse and hierarchically structured pedestrian descriptions across different styles and granularities. However, the generated captions inevitably introduce noise. To address this issue, an uncertainty-guided sample filtration module is proposed to estimate and filter out unreliable image-text pairs. Additionally, to simulate the diversity of styles and granularities in captions, a multi-grained uncertainty modeling approach is applied to model the distributions of captions, with each caption represented as a multivariate Gaussian distribution. Finally, a multi-level consistency distillation loss is employed to integrate and align the multi-grained captions, aiming to transfer knowledge across different granularities. Experimental evaluations conducted on three widely-used datasets demonstrate the significant advancements achieved by our approach.

Downloads

Published

2025-04-11

How to Cite

Li, Z., Jianbo, L., Shi, Y., Chen, J., Huang, S., Tu, L., Shen, F., & Ling, H. (2025). Exploring the Potential of Large Vision-Language Models for Unsupervised Text-Based Person Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 39(5), 5119-5127. https://doi.org/10.1609/aaai.v39i5.32543

Issue

Section

AAAI Technical Track on Computer Vision IV