Unifying Visual and Vision-Language Tracking via Contrastive Learning

Authors

  • Yinchao Ma Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China
  • Yuyang Tang Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China
  • Wenfei Yang Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China
  • Tianzhu Zhang Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China
  • Jinpeng Zhang Intelligent Science Technology Academy of CASIC
  • Mengxue Kang Intelligent Science Technology Academy of CASIC

DOI:

https://doi.org/10.1609/aaai.v38i5.28205

Keywords:

CV: Motion & Tracking, CV: Language and Vision, CV: Multi-modal Vision

Abstract

Single object tracking aims to locate the target object in a video sequence according to the state specified by different modal references, including the initial bounding box (BBOX), natural language (NL), or both (NL+BBOX). Due to the gap between different modalities, most existing trackers are designed for single or partial of these reference settings and overspecialize on the specific modality. Differently, we present a unified tracker called UVLTrack, which can simultaneously handle all three reference settings (BBOX, NL, NL+BBOX) with the same parameters. The proposed UVLTrack enjoys several merits. First, we design a modality-unified feature extractor for joint visual and language feature learning and propose a multi-modal contrastive loss to align the visual and language features into a unified semantic space. Second, a modality-adaptive box head is proposed, which makes full use of the target reference to mine ever-changing scenario features dynamically from video contexts and distinguish the target in a contrastive way, enabling robust performance in different reference settings. Extensive experimental results demonstrate that UVLTrack achieves promising performance on seven visual tracking datasets, three vision-language tracking datasets, and three visual grounding datasets. Codes and models will be open-sourced at https://github.com/OpenSpaceAI/UVLTrack.

Published

2024-03-24

How to Cite

Ma, Y., Tang, Y., Yang, W., Zhang, T., Zhang, J., & Kang, M. (2024). Unifying Visual and Vision-Language Tracking via Contrastive Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 38(5), 4107-4116. https://doi.org/10.1609/aaai.v38i5.28205

Issue

Section

AAAI Technical Track on Computer Vision IV