MIMTrack: In-Context Tracking via Masked Image Modeling

Authors

  • Xingmei Wang College of Computer Science and Technology, Harbin Engineering University
  • Guohao Nie College of Computer Science and Technology, Harbin Engineering University
  • Jiaxiang Meng College of Computer Science and Technology, Harbin Engineering University
  • Zining Yan College of Design and Engineering, National University of Singapore

DOI:

https://doi.org/10.1609/aaai.v39i8.32860

Abstract

Current Siamese and Transformer trackers commonly use various subtask branches like regression and classification to predict object states. Despite the demonstrated success, these subtask branches might introduce location and scale offsets due to discrepancies and misalignment in the respective predictions. To address this, we propose a novel generative tracker, MIMTrack, which defines tracking as a Masked Image Modeling (MIM) process combined with in-context learning (ICL). MIMTrack begins with building the visual prompt image, which consists of a template, a search area, and two target images associated with them. The target image transforms the bounding box into a unified RGB image space as other tracking image. All states prediction are naturally aligned by pixels generation of search target image. In light of this, we perform a MIM process within the visual prompt to reconstruct a masked search target image using the context from other parts. MIM with ICL makes use of implicit cross-relations between template and search area. A singlestream generative framework reduces the offset in the estimation. Furthermore, a latent memory module is introduced as a plugin to enhance pixel generation by leveraging various target appearances over time. The advanced performance observed on leading benchmark datasets highlights the simplicity and effectiveness of our MIMTrack framework.

Downloads

Published

2025-04-11

How to Cite

Wang, X., Nie, G., Meng, J., & Yan, Z. (2025). MIMTrack: In-Context Tracking via Masked Image Modeling. Proceedings of the AAAI Conference on Artificial Intelligence, 39(8), 7979–7987. https://doi.org/10.1609/aaai.v39i8.32860

Issue

Section

AAAI Technical Track on Computer Vision VII