Sequential Fusion Based Multi-Granularity Consistency for Space-Time Transformer Tracking

Kun Hu; Wenjing Yang; Wanrong Huang; Xianchen Zhou; Mingyu Cao; Jing Ren; Huibin Tan

doi:10.1609/aaai.v38i11.29145

Authors

Kun Hu College of Computer Science and Technology, National University of Defense Technology
Wenjing Yang College of Computer Science and Technology, National University of Defense Technology
Wanrong Huang College of Computer Science and Technology, National University of Defense Technology
Xianchen Zhou College of Sciences, National University of Defense Technology
Mingyu Cao College of Computer Science and Technology, National University of Defense Technology
Jing Ren College of Computer Science and Technology, National University of Defense Technology
Huibin Tan College of Computer Science and Technology, National University of Defense Technology

DOI:

https://doi.org/10.1609/aaai.v38i11.29145

Keywords:

ML: Deep Learning Algorithms, CV: Other Foundations of Computer Vision, CV: Representation Learning for Vision, CV: Learning & Optimization for CV, CV: Applications

Abstract

Regarded as a template-matching task for a long time, visual object tracking has witnessed significant progress in space-wise exploration. However, since tracking is performed on videos with substantial time-wise information, it is important to simultaneously mine the temporal contexts which have not yet been deeply explored. Previous supervised works mostly consider template reform as the breakthrough point, but they are often limited by additional computational burdens or the quality of chosen templates. To address this issue, we propose a Space-Time Consistent Transformer Tracker (STCFormer), which uses a sequential fusion framework with multi-granularity consistency constraints to learn spatiotemporal context information. We design a sequential fusion framework that recombines template and search images based on tracking results from chronological frames, fusing updated tracking states in training. To further overcome the over-reliance on the fixed template without increasing computational complexity, we design three space-time consistent constraints: Label Consistency Loss (LCL) for label-level consistency, Attention Consistency Loss (ACL) for patch-level ROI consistency, and Semantic Consistency Loss (SCL) for feature-level semantic consistency. Specifically, in ACL and SCL, the label information is used to constrain the attention and feature consistency of the target and the background, respectively, to avoid mutual interference. Extensive experiments have shown that our STCFormer outperforms many of the best-performing trackers on several popular benchmarks.

Sequential Fusion Based Multi-Granularity Consistency for Space-Time Transformer Tracking

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription