Frame-Guided Region-Aligned Representation for Video Person Re-Identification
DOI:
https://doi.org/10.1609/aaai.v34i07.6632Abstract
Pedestrians in videos are usually in a moving state, resulting in serious spatial misalignment like scale variations and pose changes, which makes the video-based person re-identification problem more challenging. To address the above issue, in this paper, we propose a Frame-Guided Region-Aligned model (FGRA) for discriminative representation learning in two steps in an end-to-end manner. Firstly, based on a frame-guided feature learning strategy and a non-parametric alignment module, a novel alignment mechanism is proposed to extract well-aligned region features. Secondly, in order to form a sequence representation, an effective feature aggregation strategy that utilizes temporal alignment score and spatial attention is adopted to fuse region features in the temporal and spatial dimensions, respectively. Experiments are conducted on benchmark datasets to demonstrate the effectiveness of the proposed method to solve the misalignment problem and the superiority of the proposed method to the existing video-based person re-identification methods.