Sequential Fusion Based Multi-Granularity Consistency for Space-Time Transformer Tracking

Authors

  • Kun Hu College of Computer Science and Technology, National University of Defense Technology
  • Wenjing Yang College of Computer Science and Technology, National University of Defense Technology
  • Wanrong Huang College of Computer Science and Technology, National University of Defense Technology
  • Xianchen Zhou College of Sciences, National University of Defense Technology
  • Mingyu Cao College of Computer Science and Technology, National University of Defense Technology
  • Jing Ren College of Computer Science and Technology, National University of Defense Technology
  • Huibin Tan College of Computer Science and Technology, National University of Defense Technology

DOI:

https://doi.org/10.1609/aaai.v38i11.29145

Keywords:

ML: Deep Learning Algorithms, CV: Other Foundations of Computer Vision, CV: Representation Learning for Vision, CV: Learning & Optimization for CV, CV: Applications

Abstract

Regarded as a template-matching task for a long time, visual object tracking has witnessed significant progress in space-wise exploration. However, since tracking is performed on videos with substantial time-wise information, it is important to simultaneously mine the temporal contexts which have not yet been deeply explored. Previous supervised works mostly consider template reform as the breakthrough point, but they are often limited by additional computational burdens or the quality of chosen templates. To address this issue, we propose a Space-Time Consistent Transformer Tracker (STCFormer), which uses a sequential fusion framework with multi-granularity consistency constraints to learn spatiotemporal context information. We design a sequential fusion framework that recombines template and search images based on tracking results from chronological frames, fusing updated tracking states in training. To further overcome the over-reliance on the fixed template without increasing computational complexity, we design three space-time consistent constraints: Label Consistency Loss (LCL) for label-level consistency, Attention Consistency Loss (ACL) for patch-level ROI consistency, and Semantic Consistency Loss (SCL) for feature-level semantic consistency. Specifically, in ACL and SCL, the label information is used to constrain the attention and feature consistency of the target and the background, respectively, to avoid mutual interference. Extensive experiments have shown that our STCFormer outperforms many of the best-performing trackers on several popular benchmarks.

Published

2024-03-24

How to Cite

Hu, K., Yang, W., Huang, W., Zhou, X., Cao, M., Ren, J., & Tan, H. (2024). Sequential Fusion Based Multi-Granularity Consistency for Space-Time Transformer Tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 38(11), 12519-12527. https://doi.org/10.1609/aaai.v38i11.29145

Issue

Section

AAAI Technical Track on Machine Learning II