Aware Distillation for Robust Vision-Language Tracking Under Linguistic Sparsity

Authors

  • Guangtong Zhang Jilin University Jilin Engineering Normal University
  • Bineng Zhong Guangxi Normal University
  • Shirui Yang Jilin University
  • Yang Wang Jilin University
  • Tian Bai Jilin University

DOI:

https://doi.org/10.1609/aaai.v40i15.38237

Abstract

Vision-language object tracking overcomes the limitations of relying solely on visual features by leveraging language descriptions of objects to provide cross-modal semantic information, thereby enhancing model robustness in complex scenarios. However, most existing high-performance vision-language trackers are trained jointly on pure visual data and vision-language multimodal data. Due to the relative sparsity of language annotations in the data, the trackers tend to prioritize the localization role of visual features, diminishing the model's attention to language information. To mitigate this issue, we propose a novel vision-language tracker: Aware Distillation for Robust Vision-Language Tracking under Linguistic Sparsity (ADTrack). We introduce a knowledge distillation framework employing a knowledge-rich teacher model and a lightweight student model to establish modality correlations between vision and language, enabling efficient modeling between visual information and language descriptions. Specifically, our lightweight student module simultaneously distills language encoding capabilities from large language models through teacher-guided learning on input language, while performing target-aware perception on template images using language descriptions to generate more effective template features for subsequent visual extraction. Furthermore, to ensure perceptual robustness in linguistically sparse scenarios, we simulate language-deficient conditions during training and employ contrastive learning to enhance model adaptability. Extensive experiments demonstrate that ADTrack reduces parameters by over 50% while achieving state-of-the-art (SOTA) performance and speed on vision-language tracking benchmarks, including LaSOT, LaSOText, TNL2K, OTB-Lang and MGIT.

Downloads

Published

2026-03-14

How to Cite

Zhang, G., Zhong, B., Yang, S., Wang, Y., & Bai, T. (2026). Aware Distillation for Robust Vision-Language Tracking Under Linguistic Sparsity. Proceedings of the AAAI Conference on Artificial Intelligence, 40(15), 12439-12447. https://doi.org/10.1609/aaai.v40i15.38237

Issue

Section

AAAI Technical Track on Computer Vision XII