Exploiting All Mamba Fusion for Efficient RGB-D Tracking

Authors

  • Ge Ying Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University School of Computer Science and Technology, Zhejiang Normal University
  • Dawei Zhang School of Computer Science and Technology, Zhejiang Normal University College of Computer Science and Technology, Zhejiang University
  • Chengzhuan Yang School of Computer Science and Technology, Zhejiang Normal University
  • Wei Liu School of Automation and Intelligent Sensing, Shanghai Jiao Tong University
  • Sang-Woon Jeon School of Computer Science and Technology, Zhejiang Normal University
  • Hua Wang Institute for Sustainable Industries and Liveable Cities, College of Engineering and Science, Victoria University
  • Changqin Huang Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University
  • Zhonglong Zheng Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University School of Computer Science and Technology, Zhejiang Normal University China-Mozambique “Belt and Road” Joint Laboratory on Smart Agriculture, Zhejiang Normal University

DOI:

https://doi.org/10.1609/aaai.v40i14.38195

Abstract

Despite the progress made through deep learning, existing Visual Object Tracking (VOT) frameworks struggle with real-world challenges. Recent approaches incorporate additional modalities like Depth, Thermal Infrared, and Language to enhance the robustness of VOT, particularly with the improvement of the depth sensor precision, facilitating RGB-D tracking. However, current RGB-D trackers often copy RGB tracking paradigms, leading to inefficiency due to two-stream architectures that fail to exploit heterogeneous features, and reliance on simplistic or large-parameter fusion methods. To address these challenges, we propose AMTrack, a one-stream RGB-D tracker leveraging Mamba's linear complexity for simultaneous feature extraction and two-stage cross-modal feature fusion. Our innovation also includes a low-parameter Multimodal Mix Mamba (3M) module, which optimizes deep feature fusion and reduces computational overhead. The advantage of the 3M module stems from our Multimodal State Space Model (MSSM), a multimodal feature interaction component reconstructed based on SSM. Experiments across multiple RGB-D tracking datasets indicate that AMTrack achieves superior performance with lower parameters and memory demands compared to state-of-the-arts.

Downloads

Published

2026-03-14

How to Cite

Ying, G., Zhang, D., Yang, C., Liu, W., Jeon, S.-W., Wang, H., … Zheng, Z. (2026). Exploiting All Mamba Fusion for Efficient RGB-D Tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 40(14), 12063–12071. https://doi.org/10.1609/aaai.v40i14.38195

Issue

Section

AAAI Technical Track on Computer Vision XI