Temporal Correlation Vision Transformer for Video Person Re-Identification

Authors

  • Pengfei Wu Xi’an Jiaotong University
  • Le Wang Xi'an Jiaotong University
  • Sanping Zhou Xi'an Jiaotong University
  • Gang Hua Wormpex AI Research
  • Changyin Sun Anhui University

DOI:

https://doi.org/10.1609/aaai.v38i6.28424

Keywords:

CV: Image and Video Retrieval, CV: Video Understanding & Activity Analysis

Abstract

Video Person Re-Identification (Re-ID) is a task of retrieving persons from multi-camera surveillance systems. Despite the progress made in leveraging spatio-temporal information in videos, occlusion in dense crowds still hinders further progress. To address this issue, we propose a Temporal Correlation Vision Transformer (TCViT) for video person Re-ID. TCViT consists of a Temporal Correlation Attention (TCA) module and a Learnable Temporal Aggregation (LTA) module. The TCA module is designed to reduce the impact of non-target persons by relative state, while the LTA module is used to aggregate frame-level features based on their completeness. Specifically, TCA is a parameter-free module that first aligns frame-level features to restore semantic coherence in videos and then enhances the features of the target person according to temporal correlation. Additionally, unlike previous methods that treat each frame equally with a pooling layer, LTA introduces a lightweight learnable module to weigh and aggregate frame-level features under the guidance of a classification score. Extensive experiments on four prevalent benchmarks demonstrate that our method achieves state-of-the-art performance in video Re-ID.

Published

2024-03-24

How to Cite

Wu, P., Wang, L., Zhou, S., Hua, G., & Sun, C. (2024). Temporal Correlation Vision Transformer for Video Person Re-Identification. Proceedings of the AAAI Conference on Artificial Intelligence, 38(6), 6083-6091. https://doi.org/10.1609/aaai.v38i6.28424

Issue

Section

AAAI Technical Track on Computer Vision V