AlignTrack: Top-Down Spatiotemporal Resolution Alignment for RGB-Event Visual Tracking

Chuanyu Sun; Jiqing Zhang; Yang Wang; Yuanchen Wang; Yutong Jiang; Baocai Yin; Xin Yang

doi:10.1609/aaai.v40i11.37874

Authors

Chuanyu Sun Dalian University of Technology
Jiqing Zhang Dalian Martime University
Yang Wang Dalian University of Technology
Yuanchen Wang Dalian University of Technology
Yutong Jiang China Northern Vehicle Research Institute
Baocai Yin Beijing University of Technology
Xin Yang Dalian University of Technology

DOI:

https://doi.org/10.1609/aaai.v40i11.37874

Abstract

Most existing RGB-Event trackers rely on strictly aligned datasets, overlooking the asynchronous spatio-temporal resolutions common in real-world scenarios. This methodological limitation impedes effective RGB-Event feature alignment and ultimately degrades tracking performance. To overcome this limitation, we propose AlignTrack, a novel tracking framework built upon a Top-Down Alignment (TDA) strategy inspired by the human visual system. Our TDA framework follows an encode-decode-align paradigm: it first encodes multimodal features to generate target-related priors, which are then progressively decoded to guide a subsequent feature alignment pass. Within this framework, we introduce two key innovations: (1) a Cross-Prior Attention (CPA) module that effectively generates and integrates cross-modal priors, and (2) a Cross-Modal Semantic Alignment (CSA) loss that maximizes mutual information to enforce semantic consistency between modalities. Extensive experiments show that AlignTrack achieves state-of-the-art performance on four challenging RGB-Event tracking benchmarks, demonstrating its robustness in both aligned and unaligned scenarios. Ablation studies further validate the significant contribution of each proposed component.

AlignTrack: Top-Down Spatiotemporal Resolution Alignment for RGB-Event Visual Tracking

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information