H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving

Siran Chen; Yuxiao Luo; Yue Ma; Yu Qiao; Yali Wang

doi:10.1609/aaai.v39i2.32220

Authors

Siran Chen Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Science
Yuxiao Luo Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences The Hong Kong Polytechnic University
Yue Ma The Hong Kong University of Science and Technology
Yu Qiao Shanghai Artificial Intelligence Laboratory Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Yali Wang Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Shanghai Artificial Intelligence Laboratory

DOI:

https://doi.org/10.1609/aaai.v39i2.32220

Abstract

With the prevalence of Multimodal Large Language Models(MLLMs), autonomous driving has encountered new opportunities and challenges. In particular, multi-modal video understanding is critical to interactively analyze what will happen in the procedure of autonomous driving. However, videos in such a dynamical scene that often contains complex spatial-temporal movements, which restricts the generalization capacity of the existing MLLMs in this field. To bridge the gap, we propose a novel Hierarchical Mamba Adaptation (H-MBA) framework to fit the complicated motion changes in autonomous driving videos. Specifically, our H-MBA consists of two distinct modules, including Context Mamba (C-Mamba) and Query Mamba (Q-Mamba). First, C-Mamba contains various types of structure state space models, which can effectively capture multi-granularity video context for different temporal resolution. Second, Q-Mamba flexibly transforms the current frame as the learnable query, and attentively select multi-granularity video context into query. Consequently, it can adaptively integrate all the video contexts of multi-scale temporal resolutions to enhance video understanding. Via a plug-and-play paradigm in MLLMs, our H-MBA shows the remarkable performance on multi-modal video tasks in autonomous driving, e.g., for risk object detection, it outperforms the previous SOTA method with 5.5% mIoU improvement.

H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information