ODTrack: Online Dense Temporal Token Learning for Visual Tracking

Authors

  • Yaozong Zheng Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University Guangxi Key Lab of Multi-Source Information Mining and Security, Guangxi Normal University
  • Bineng Zhong Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University Guangxi Key Lab of Multi-Source Information Mining and Security, Guangxi Normal University
  • Qihua Liang Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University Guangxi Key Lab of Multi-Source Information Mining and Security, Guangxi Normal University
  • Zhiyi Mo Guangxi Key Laboratory of Machine Vision and Intelligent Control, Wuzhou University
  • Shengping Zhang Harbin Institute of Technology
  • Xianxian Li Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University Guangxi Key Lab of Multi-Source Information Mining and Security, Guangxi Normal University

DOI:

https://doi.org/10.1609/aaai.v38i7.28591

Keywords:

CV: Motion & Tracking

Abstract

Online contextual reasoning and association across consecutive video frames are critical to perceive instances in visual tracking. However, most current top-performing trackers persistently lean on sparse temporal relationships between reference and search frames via an offline mode. Consequently, they can only interact independently within each image-pair and establish limited temporal correlations. To alleviate the above problem, we propose a simple, flexible and effective video-level tracking pipeline, named ODTrack, which densely associates the contextual relationships of video frames in an online token propagation manner. ODTrack receives video frames of arbitrary length to capture the spatio-temporal trajectory relationships of an instance, and compresses the discrimination features (localization information) of a target into a token sequence to achieve frame-to-frame association. This new solution brings the following benefits: 1) the purified token sequences can serve as prompts for the inference in the next video frame, whereby past information is leveraged to guide future inference; 2) the complex online update strategies are effectively avoided by the iterative propagation of token sequences, and thus we can achieve more efficient model representation and computation. ODTrack achieves a new SOTA performance on seven benchmarks, while running at real-time speed. Code and models are available at https://github.com/GXNU-ZhongLab/ODTrack.

Published

2024-03-24

How to Cite

Zheng, Y., Zhong, B., Liang, Q., Mo, Z., Zhang, S., & Li, X. (2024). ODTrack: Online Dense Temporal Token Learning for Visual Tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 38(7), 7588-7596. https://doi.org/10.1609/aaai.v38i7.28591

Issue

Section

AAAI Technical Track on Computer Vision VI