MUSE: Mamba Is Efficient Multi-scale Learner for Text-video Retrieval

Authors

  • Haoran Tang Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University Peng Cheng Laboratory
  • Meng Cao Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University
  • Jinfa Huang Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University
  • Ruyang Liu Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University Peng Cheng Laboratory
  • Peng Jin Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University Peng Cheng Laboratory
  • Ge Li Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University
  • Xiaodan Liang Sun Yat-sen University

DOI:

https://doi.org/10.1609/aaai.v39i7.32778

Abstract

Text-Video Retrieval (TVR) aims to align and associate relevant video content with corresponding natural language queries. Most existing TVR methods are based on large-scale pre-trained vision-language models (e.g., CLIP). However, due to CLIP's inherent plain structure, few TVR methods explore the multi-scale representations which offer richer contextual information for a more thorough understanding. To this end, we propose MUSE, a multi-scale mamba with linear computational complexity for efficient cross-resolution modeling. Specifically, the multi-scale representations are generated by applying a feature pyramid on the last single-scale feature map. Then, we employ the Mamba structure as an efficient multi-scale learner to jointly learn scale-wise representations. Furthermore, we conduct comprehensive studies to investigate different model structures and designs. Extensive results on three popular benchmarks have validated the superiority of MUSE.

Downloads

Published

2025-04-11

How to Cite

Tang, H., Cao, M., Huang, J., Liu, R., Jin, P., Li, G., & Liang, X. (2025). MUSE: Mamba Is Efficient Multi-scale Learner for Text-video Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 39(7), 7238–7246. https://doi.org/10.1609/aaai.v39i7.32778

Issue

Section

AAAI Technical Track on Computer Vision VI