SAM2-OV: A Novel Detection-Only Tuning Paradigm for Open-Vocabulary Multi-Object Tracking

Authors

  • Yangkai Chen Xiamen University
  • Qiangqiang Wu City University of Hong Kong
  • Guangyao Li Xiamen University
  • Junlong Gao Xiamen University
  • Guanglin Niu Beihang University
  • Hanzi Wang Xiamen University

DOI:

https://doi.org/10.1609/aaai.v40i4.37301

Abstract

Open-vocabulary multi-object tracking (OV-MOT) aims to track objects with unseen categories beyond the training set. While existing methods rely on pseudo video sequences synthesized from static images, they struggle to model realistic motion patterns, resulting in limited association performance in real-world scenarios. To alleviate these issues, we propose SAM2-OV, a novel association learning-free OV-MOT method that adopts a detection-only tuning paradigm, eliminating the need for synthetic sequences or spatiotemporal supervision and substantially reducing the overall learnable parameters. The core of our method is a Unified Detection Module (UDM), which effectively provides object-level prompts to enable SAM2 for OV-MOT. Enabled by UDM, SAM2-OV is the first to integrate SAM2 for OV-MOT, fully unleashing its zero-shot cross-frame association ability. To further enhance object association under occlusion and abrupt motion, we introduce a Motion Prior Assistance Module (MPAM) that incorporates motion cues into the mask selection process. In addition, a Semantic Enhancement Adapter (SEA) distilled from CLIP is used to improve classification generalization. A sparse prompting strategy is also adopted to reduce computational redundancy by triggering detection only on selected keyframes. As only the detection module is tuned on static images, the overall training process remains simple and efficient. Experiments on the TAO dataset demonstrate that SAM2-OV achieves state-of-the-art performance under the TETA metric, particularly on novel categories. Evaluations on the KITTI dataset show the strong zero-shot cross-domain transferability of our SAM2-OV.

Downloads

Published

2026-03-14

How to Cite

Chen, Y., Wu, Q., Li, G., Gao, J., Niu, G., & Wang, H. (2026). SAM2-OV: A Novel Detection-Only Tuning Paradigm for Open-Vocabulary Multi-Object Tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 40(4), 3083-3091. https://doi.org/10.1609/aaai.v40i4.37301

Issue

Section

AAAI Technical Track on Computer Vision I