Exploiting All Mamba Fusion for Efficient RGB-D Tracking

Ge Ying; Dawei Zhang; Chengzhuan Yang; Wei Liu; Sang-Woon Jeon; Hua Wang; Changqin Huang; Zhonglong Zheng

doi:10.1609/aaai.v40i14.38195

Authors

Ge Ying Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University School of Computer Science and Technology, Zhejiang Normal University
Dawei Zhang School of Computer Science and Technology, Zhejiang Normal University College of Computer Science and Technology, Zhejiang University
Chengzhuan Yang School of Computer Science and Technology, Zhejiang Normal University
Wei Liu School of Automation and Intelligent Sensing, Shanghai Jiao Tong University
Sang-Woon Jeon School of Computer Science and Technology, Zhejiang Normal University
Hua Wang Institute for Sustainable Industries and Liveable Cities, College of Engineering and Science, Victoria University
Changqin Huang Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University
Zhonglong Zheng Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University School of Computer Science and Technology, Zhejiang Normal University China-Mozambique “Belt and Road” Joint Laboratory on Smart Agriculture, Zhejiang Normal University

DOI:

https://doi.org/10.1609/aaai.v40i14.38195

Abstract

Despite the progress made through deep learning, existing Visual Object Tracking (VOT) frameworks struggle with real-world challenges. Recent approaches incorporate additional modalities like Depth, Thermal Infrared, and Language to enhance the robustness of VOT, particularly with the improvement of the depth sensor precision, facilitating RGB-D tracking. However, current RGB-D trackers often copy RGB tracking paradigms, leading to inefficiency due to two-stream architectures that fail to exploit heterogeneous features, and reliance on simplistic or large-parameter fusion methods. To address these challenges, we propose AMTrack, a one-stream RGB-D tracker leveraging Mamba's linear complexity for simultaneous feature extraction and two-stage cross-modal feature fusion. Our innovation also includes a low-parameter Multimodal Mix Mamba (3M) module, which optimizes deep feature fusion and reduces computational overhead. The advantage of the 3M module stems from our Multimodal State Space Model (MSSM), a multimodal feature interaction component reconstructed based on SSM. Experiments across multiple RGB-D tracking datasets indicate that AMTrack achieves superior performance with lower parameters and memory demands compared to state-of-the-arts.

Exploiting All Mamba Fusion for Efficient RGB-D Tracking

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information