Aware Distillation for Robust Vision-Language Tracking Under Linguistic Sparsity

Guangtong Zhang; Bineng Zhong; Shirui Yang; Yang Wang; Tian Bai

doi:10.1609/aaai.v40i15.38237

Authors

Guangtong Zhang Jilin University Jilin Engineering Normal University
Bineng Zhong Guangxi Normal University
Shirui Yang Jilin University
Yang Wang Jilin University
Tian Bai Jilin University

DOI:

https://doi.org/10.1609/aaai.v40i15.38237

Abstract

Vision-language object tracking overcomes the limitations of relying solely on visual features by leveraging language descriptions of objects to provide cross-modal semantic information, thereby enhancing model robustness in complex scenarios. However, most existing high-performance vision-language trackers are trained jointly on pure visual data and vision-language multimodal data. Due to the relative sparsity of language annotations in the data, the trackers tend to prioritize the localization role of visual features, diminishing the model's attention to language information. To mitigate this issue, we propose a novel vision-language tracker: Aware Distillation for Robust Vision-Language Tracking under Linguistic Sparsity (ADTrack). We introduce a knowledge distillation framework employing a knowledge-rich teacher model and a lightweight student model to establish modality correlations between vision and language, enabling efficient modeling between visual information and language descriptions. Specifically, our lightweight student module simultaneously distills language encoding capabilities from large language models through teacher-guided learning on input language, while performing target-aware perception on template images using language descriptions to generate more effective template features for subsequent visual extraction. Furthermore, to ensure perceptual robustness in linguistically sparse scenarios, we simulate language-deficient conditions during training and employ contrastive learning to enhance model adaptability. Extensive experiments demonstrate that ADTrack reduces parameters by over 50% while achieving state-of-the-art (SOTA) performance and speed on vision-language tracking benchmarks, including LaSOT, LaSOText, TNL2K, OTB-Lang and MGIT.

Aware Distillation for Robust Vision-Language Tracking Under Linguistic Sparsity

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information