Exploiting All Mamba Fusion for Efficient RGB-D Tracking
DOI:
https://doi.org/10.1609/aaai.v40i14.38195Abstract
Despite the progress made through deep learning, existing Visual Object Tracking (VOT) frameworks struggle with real-world challenges. Recent approaches incorporate additional modalities like Depth, Thermal Infrared, and Language to enhance the robustness of VOT, particularly with the improvement of the depth sensor precision, facilitating RGB-D tracking. However, current RGB-D trackers often copy RGB tracking paradigms, leading to inefficiency due to two-stream architectures that fail to exploit heterogeneous features, and reliance on simplistic or large-parameter fusion methods. To address these challenges, we propose AMTrack, a one-stream RGB-D tracker leveraging Mamba's linear complexity for simultaneous feature extraction and two-stage cross-modal feature fusion. Our innovation also includes a low-parameter Multimodal Mix Mamba (3M) module, which optimizes deep feature fusion and reduces computational overhead. The advantage of the 3M module stems from our Multimodal State Space Model (MSSM), a multimodal feature interaction component reconstructed based on SSM. Experiments across multiple RGB-D tracking datasets indicate that AMTrack achieves superior performance with lower parameters and memory demands compared to state-of-the-arts.Published
2026-03-14
How to Cite
Ying, G., Zhang, D., Yang, C., Liu, W., Jeon, S.-W., Wang, H., … Zheng, Z. (2026). Exploiting All Mamba Fusion for Efficient RGB-D Tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 40(14), 12063–12071. https://doi.org/10.1609/aaai.v40i14.38195
Issue
Section
AAAI Technical Track on Computer Vision XI