Temporal Correlation Vision Transformer for Video Person Re-Identification

Pengfei Wu; Le Wang; Sanping Zhou; Gang Hua; Changyin Sun

doi:10.1609/aaai.v38i6.28424

Authors

Pengfei Wu Xi’an Jiaotong University
Le Wang Xi'an Jiaotong University
Sanping Zhou Xi'an Jiaotong University
Gang Hua Wormpex AI Research
Changyin Sun Anhui University

DOI:

https://doi.org/10.1609/aaai.v38i6.28424

Keywords:

CV: Image and Video Retrieval, CV: Video Understanding & Activity Analysis

Abstract

Video Person Re-Identification (Re-ID) is a task of retrieving persons from multi-camera surveillance systems. Despite the progress made in leveraging spatio-temporal information in videos, occlusion in dense crowds still hinders further progress. To address this issue, we propose a Temporal Correlation Vision Transformer (TCViT) for video person Re-ID. TCViT consists of a Temporal Correlation Attention (TCA) module and a Learnable Temporal Aggregation (LTA) module. The TCA module is designed to reduce the impact of non-target persons by relative state, while the LTA module is used to aggregate frame-level features based on their completeness. Specifically, TCA is a parameter-free module that first aligns frame-level features to restore semantic coherence in videos and then enhances the features of the target person according to temporal correlation. Additionally, unlike previous methods that treat each frame equally with a pooling layer, LTA introduces a lightweight learnable module to weigh and aggregate frame-level features under the guidance of a classification score. Extensive experiments on four prevalent benchmarks demonstrate that our method achieves state-of-the-art performance in video Re-ID.

Temporal Correlation Vision Transformer for Video Person Re-Identification

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information