CLIP-driven View-aware Prompt Learning for Unsupervised Vehicle Re-identification
DOI:
https://doi.org/10.1609/aaai.v39i8.32962Abstract
With the emergence of vision-language pre-trained models, such as CLIP, some textual prompts have been gradually introduced recently into re-identification (Re-ID) tasks to obtain considerably robust multimodal information. However, most textual descriptions based on vehicle Re-ID tasks only contain identity index words without specific words to describe vehicle view information, thereby resulting in difficulty to be widely applied in vehicle Re-ID tasks with view variations. This case inspires us to propose a CLIP-driven view-aware prompt learning framework for unsupervised vehicle Re-ID. We first design a learnable textual prompt template called view-aware context optimization (ViewCoOp) based on dynamic multi-view word embeddings, which can fully obtain the proportion and position encoding of each view in the whole vehicle body region. Subsequently, a cross-modal mutual graph is constructed to explore the connections between inter-modal and intra-modal. Each sample is treated as a graph node, which extracts textual features based on ViewCoOp and the visual features of images. Moreover, leveraging the inter-cluster and intra-cluster correlation in the bimodal clustering results in the determination of connectivity between graph node pairs. Lastly, the proposed cross-modal mutual graph method utilizes supervised information from the bimodal gap to directly fine-tune the image encoder of CLIP for downstream unsupervised vehicle Re-ID tasks. Extensive experiments verify that the proposed method is capable of effectively obtaining cross-modal description ability from multiple views.Downloads
Published
2025-04-11
How to Cite
Xu, J., Wang, Q., Xiong, X., Gai, D., Zhou, R., & Wang, D. (2025). CLIP-driven View-aware Prompt Learning for Unsupervised Vehicle Re-identification. Proceedings of the AAAI Conference on Artificial Intelligence, 39(8), 8896–8904. https://doi.org/10.1609/aaai.v39i8.32962
Issue
Section
AAAI Technical Track on Computer Vision VII