Exploring the Potential of Large Vision-Language Models for Unsupervised Text-Based Person Retrieval

Zongyi Li; Li Jianbo; Yuxuan Shi; Jiazhong Chen; Shijuan Huang; Linnan Tu; Fei Shen; Hefei Ling

doi:10.1609/aaai.v39i5.32543

Authors

Zongyi Li Huazhong University of Science and Technology
Li Jianbo Huazhong University of Science and Technology
Yuxuan Shi National Engineering Research Center of Educational Big Data and the Faculty of Artificial Intelligence in Education, Central China Normal University
Jiazhong Chen Huazhong University of Science and Technology
Shijuan Huang Huazhong University of Science and Technology
Linnan Tu Huazhong University of Science and Technology
Fei Shen Nanjing University of Science and Technology
Hefei Ling Huazhong University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v39i5.32543

Abstract

The aim of text-based person retrieval is to identify pedestrians using natural language descriptions within a large-scale image gallery. Traditional methods rely heavily on manually annotated image-text pairs, which are resource-intensive to obtain. With the emergence of Large Vision-Language Models (LVLMs), the advanced capabilities of contemporary models in image understanding have led to the generation of highly accurate captions. Therefore, this paper explores the potential of employing Large Vision-Language Models for unsupervised text-based pedestrian image retrieval and proposes a Multi-grained Uncertainty Modeling and Alignment framework (MUMA). Initially, multiple Large Vision-Language Models are employed to generate diverse and hierarchically structured pedestrian descriptions across different styles and granularities. However, the generated captions inevitably introduce noise. To address this issue, an uncertainty-guided sample filtration module is proposed to estimate and filter out unreliable image-text pairs. Additionally, to simulate the diversity of styles and granularities in captions, a multi-grained uncertainty modeling approach is applied to model the distributions of captions, with each caption represented as a multivariate Gaussian distribution. Finally, a multi-level consistency distillation loss is employed to integrate and align the multi-grained captions, aiming to transfer knowledge across different granularities. Experimental evaluations conducted on three widely-used datasets demonstrate the significant advancements achieved by our approach.

Exploring the Potential of Large Vision-Language Models for Unsupervised Text-Based Person Retrieval

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information