AlignTrack: Top-Down Spatiotemporal Resolution Alignment for RGB-Event Visual Tracking

Authors

  • Chuanyu Sun Dalian University of Technology
  • Jiqing Zhang Dalian Martime University
  • Yang Wang Dalian University of Technology
  • Yuanchen Wang Dalian University of Technology
  • Yutong Jiang China Northern Vehicle Research Institute
  • Baocai Yin Beijing University of Technology
  • Xin Yang Dalian University of Technology

DOI:

https://doi.org/10.1609/aaai.v40i11.37874

Abstract

Most existing RGB-Event trackers rely on strictly aligned datasets, overlooking the asynchronous spatio-temporal resolutions common in real-world scenarios. This methodological limitation impedes effective RGB-Event feature alignment and ultimately degrades tracking performance. To overcome this limitation, we propose AlignTrack, a novel tracking framework built upon a Top-Down Alignment (TDA) strategy inspired by the human visual system. Our TDA framework follows an encode-decode-align paradigm: it first encodes multimodal features to generate target-related priors, which are then progressively decoded to guide a subsequent feature alignment pass. Within this framework, we introduce two key innovations: (1) a Cross-Prior Attention (CPA) module that effectively generates and integrates cross-modal priors, and (2) a Cross-Modal Semantic Alignment (CSA) loss that maximizes mutual information to enforce semantic consistency between modalities. Extensive experiments show that AlignTrack achieves state-of-the-art performance on four challenging RGB-Event tracking benchmarks, demonstrating its robustness in both aligned and unaligned scenarios. Ablation studies further validate the significant contribution of each proposed component.

Downloads

Published

2026-03-14

How to Cite

Sun, C., Zhang, J., Wang, Y., Wang, Y., Jiang, Y., Yin, B., & Yang, X. (2026). AlignTrack: Top-Down Spatiotemporal Resolution Alignment for RGB-Event Visual Tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 40(11), 9171–9179. https://doi.org/10.1609/aaai.v40i11.37874

Issue

Section

AAAI Technical Track on Computer Vision VIII