TokenMatcher: Diverse Tokens Matching for Unsupervised Visible-Infrared Person Re-Identification

Xiao Wang; Lekai Liu; Bin Yang; Mang Ye; Zheng Wang; Xin Xu

doi:10.1609/aaai.v39i8.32855

Authors

Xiao Wang School of Computer Science and Technology, Wuhan University of Science and Technology School of Computer Science, Wuhan University
Lekai Liu School of Computer Science and Technology, Wuhan University of Science and Technology Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, China
Bin Yang School of Computer Science, Wuhan University
Mang Ye School of Computer Science, Wuhan University
Zheng Wang School of Computer Science, Wuhan University
Xin Xu School of Computer Science and Technology, Wuhan University of Science and Technology Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, China

DOI:

https://doi.org/10.1609/aaai.v39i8.32855

Abstract

Unsupervised visible-infrared person re-identification (US-VI-ReID) seeks to match infrared and visible images of the same individual without the use of annotations. Current methods typically derive cross-modal correspondences through a single global feature matching process for generating pseudo labels and learning modality-invariant features. However, this matching approach is hindered by both intra-modality and inter-modality discrepancies, which result in imprecise measurements. As a consequence, the clustering of individuals with single global feature is often incomplete and unreliable, leading to suboptimal performance in cross-modal clustering tasks. To address these challenges and to extract cross-modality discriminative identity information, we propose a TokenMatcher, which encompasses three key components: Diverse Tokens Matching (DTM), Diverse Tokens Neighbor Learning (DTNL), and the Homogeneous Fusion (HF) Module. DTM utilizes multiple class tokens within the visual transformer framework to capture diverse embedding representations, thereby facilitating the integration of fine-grained information essential for reliable cross-modality correspondences. DTNL enhances the intra-modality and inter-modality consistency among diverse tokens by refining neighborhood sets with insights from neighboring tokens and camera information, promoting robust neighborhood learning and fostering discriminative identity information. Additionally, the HF module consolidates clusters of the same identity while effectively separating those of different identities. Extensive experiments conducted on the publicly available SYSU-MM01 and RegDB datasets demonstrate the efficacy of the proposed method.

TokenMatcher: Diverse Tokens Matching for Unsupervised Visible-Infrared Person Re-Identification

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information