Temporal Correlation Vision Transformer for Video Person Re-Identification
DOI:
https://doi.org/10.1609/aaai.v38i6.28424Keywords:
CV: Image and Video Retrieval, CV: Video Understanding & Activity AnalysisAbstract
Video Person Re-Identification (Re-ID) is a task of retrieving persons from multi-camera surveillance systems. Despite the progress made in leveraging spatio-temporal information in videos, occlusion in dense crowds still hinders further progress. To address this issue, we propose a Temporal Correlation Vision Transformer (TCViT) for video person Re-ID. TCViT consists of a Temporal Correlation Attention (TCA) module and a Learnable Temporal Aggregation (LTA) module. The TCA module is designed to reduce the impact of non-target persons by relative state, while the LTA module is used to aggregate frame-level features based on their completeness. Specifically, TCA is a parameter-free module that first aligns frame-level features to restore semantic coherence in videos and then enhances the features of the target person according to temporal correlation. Additionally, unlike previous methods that treat each frame equally with a pooling layer, LTA introduces a lightweight learnable module to weigh and aggregate frame-level features under the guidance of a classification score. Extensive experiments on four prevalent benchmarks demonstrate that our method achieves state-of-the-art performance in video Re-ID.Downloads
Published
2024-03-24
How to Cite
Wu, P., Wang, L., Zhou, S., Hua, G., & Sun, C. (2024). Temporal Correlation Vision Transformer for Video Person Re-Identification. Proceedings of the AAAI Conference on Artificial Intelligence, 38(6), 6083-6091. https://doi.org/10.1609/aaai.v38i6.28424
Issue
Section
AAAI Technical Track on Computer Vision V