Towards Universal Physical Attacks on Single Object Tracking


  • Li Ding Xi'an Jiaotong University, University of British Columbia
  • Yongwei Wang University of British Columbia
  • Kaiwen Yuan University of British Columbia
  • Minyang Jiang University of British Columbia
  • Ping Wang Xi'an Jiaotong University
  • Hua Huang Beijing Normal University
  • Z. Jane Wang University of British Columbia


Adversarial Attacks & Robustness


Recent studies show that small perturbations in video frames could misguide single object trackers. However, such attacks have been mainly designed for digital-domain videos (i.e., perturbation on full images), which makes them practically infeasible to evaluate the adversarial vulnerability of trackers in real-world scenarios. Here we made the first step towards physically feasible adversarial attacks against visual tracking in real scenes with a universal patch to camouflage single object trackers. Fundamentally different from physical object detection, the essence of single object tracking lies in the feature matching between the search image and templates, and we therefore specially design the maximum textural discrepancy (MTD), a resolution-invariant and target location-independent feature de-matching loss. The MTD distills global textural information of the template and search images at hierarchical feature scales prior to performing feature attacks. Moreover, we evaluate two shape attacks, the regression dilation and shrinking, to generate stronger and more controllable attacks. Further, we employ a set of transformations to simulate diverse visual tracking scenes in the wild. Experimental results show the effectiveness of the physically feasible attacks on SiamMask and SiamRPN++ visual trackers both in digital and physical scenes.




How to Cite

Ding, L., Wang, Y., Yuan, K., Jiang, M., Wang, P., Huang, H., & Wang, Z. J. (2021). Towards Universal Physical Attacks on Single Object Tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 35(2), 1236-1245. Retrieved from



AAAI Technical Track on Computer Vision I