Joint Implicit and Explicit Language Learning for Pedestrian Attribute Recognition
DOI:
https://doi.org/10.1609/aaai.v40i15.38296Abstract
Pedestrian attribute recognition (PAR) has received increasing attention due to its wide application in video surveillance and pedestrian analysis. Some text-enhanced methods tackle this task by converting attributes into language descriptions to facilitate interactive learning between attributes and visual images. However, these generic languages fail to uniquely describe different pedestrian images, missing individual characteristics. In this paper, we propose a Joint Implicit and Explicit Language Guidance Enhancement Learning (JGEL) method, which converts each pedestrian image into a language description with dual language learning to effectively learn enhanced attribute information. Specifically, we first propose an Implicit Language Guidance Learning (ILGL) stream. It projects visual image features into the text embedding space to generate pseudo-word tokens, implicitly modeling image attributes and providing personalized descriptions. Moreover, we propose an Explicit Attribute Enhancement Learning (EAEL) stream to guide the generated pseudo-word tokens obtained by ILGL explicitly aligned with pedestrian attributes, which can effectively align the pseudo-word tokens with the attribute concepts in the text embedding space. Extensive experiments show that JGEL has significant advantages in improving the performance of PAR and the challenging zero-shot PAR task.Downloads
Published
2026-03-14
How to Cite
Zhang, Y., Tan, L., Lu, Y., Yan, Y., & Wang, H. (2026). Joint Implicit and Explicit Language Learning for Pedestrian Attribute Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 40(15), 12970–12978. https://doi.org/10.1609/aaai.v40i15.38296
Issue
Section
AAAI Technical Track on Computer Vision XII