Cross-modulated Attention Transformer for RGBT Tracking

Authors

  • Yun Xiao School of Artificial Intelligence, Anhui University, Hefei, China Anhui Provincial Key Laboratory of Security Artificial Intelligence, Hefei, China Information Materials and Intelligent Sensing Laboratory of Anhui Province, Hefei, China
  • Jiacong Zhao School of Artificial Intelligence, Anhui University, Hefei, China
  • Andong Lu School of Computer Science and Technology, Anhui University, Hefei, China
  • Chenglong Li School of Artificial Intelligence, Anhui University, Hefei, China Anhui Provincial Key Laboratory of Security Artificial Intelligence, Hefei, China Information Materials and Intelligent Sensing Laboratory of Anhui Province, Hefei, China
  • Bing Yin iFLYTEK CO.LTD., Hefei, China
  • Yin Lin iFLYTEK CO.LTD., Hefei, China
  • Cong Liu iFLYTEK CO.LTD., Hefei, China

DOI:

https://doi.org/10.1609/aaai.v39i8.32938

Abstract

Existing Transformer-based RGBT trackers achieve remarkable performance benefits by leveraging self-attention to extract uni-modal features and cross-attention to enhance multi-modal feature interaction and search-template correlation. Nevertheless, the independent search-template correlation calculations are prone to be affected by low-quality data, which might result in contradictory and ambiguous correlation weights. It not only limits the intra-modal feature representation, but also harms the robustness of cross-attention for multi-modal feature interaction and search-template correlation computation. To address these issues, we propose a novel approach called Cross-modulated Attention Transformer (CAFormer), which innovatively integrates inter-modality interaction into the search-template correlation computation within typical attention mechanism, for RGBT tracking. In particular, we first independently generate correlation maps for each modality and feed them into the designed correlation modulated enhancement module, which can modify inaccurate correlation weights by seeking the consensus between modalities. Such kind of design unifies self-attention and cross-attention schemes, which not only alleviates inaccurate attention weight computation in self-attention but also eliminates redundant computation introduced by extra cross-attention scheme. In addition, we design a collaborative token elimination strategy to further improve tracking inference efficiency and accuracy. Experiments on five public RGBT tracking benchmarks show the outstanding performance of the proposed CAFormer against state-of-the-art methods.

Downloads

Published

2025-04-11

How to Cite

Xiao, Y., Zhao, J., Lu, A., Li, C., Yin, B., Lin, Y., & Liu, C. (2025). Cross-modulated Attention Transformer for RGBT Tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 39(8), 8682-8690. https://doi.org/10.1609/aaai.v39i8.32938

Issue

Section

AAAI Technical Track on Computer Vision VII