Efficient Self-Supervised Video Hashing with Selective State Spaces

Authors

  • Jinpeng Wang Tsinghua Shenzhen International Graduate School, Tsinghua University
  • Niu Lian Harbin Institute of Technology, Shenzhen
  • Jun Li Harbin Institute of Technology, Shenzhen
  • Yuting Wang Tsinghua Shenzhen International Graduate School, Tsinghua University
  • Yan Feng Meituan, Beijing
  • Bin Chen Harbin Institute of Technology, Shenzhen Research Center of Artificial Intelligence, Peng Cheng Laboratory
  • Yongbing Zhang Harbin Institute of Technology, Shenzhen
  • Shu-Tao Xia Tsinghua Shenzhen International Graduate School, Tsinghua University Research Center of Artificial Intelligence, Peng Cheng Laboratory

DOI:

https://doi.org/10.1609/aaai.v39i7.32835

Abstract

Self-supervised video hashing (SSVH) is a practical task in video indexing and retrieval. Although Transformers are predominant in SSVH for their impressive temporal modeling capabilities, they often suffer from computational and memory inefficiencies. Drawing inspiration from Mamba, an advanced state-space model, we explore its potential in SSVH to achieve a better balance between efficacy and efficiency. We introduce S5VH, a Mamba-based video hashing model with an improved self-supervised learning paradigm. Specifically, we design bidirectional Mamba layers for both the encoder and decoder, which are effective and efficient in capturing temporal relationships thanks to the data-dependent selective scanning mechanism with linear complexity. In our learning strategy, we transform global semantics in the feature space into semantically consistent and discriminative hash centers, followed by a center alignment loss as a global learning signal. Our self-local-global (SLG) paradigm significantly improves learning efficiency, leading to faster and better convergence. Extensive experiments demonstrate S5VH's improvements over state-of-the-art methods, superior transferability, and scalable advantages in inference efficiency.

Downloads

Published

2025-04-11

How to Cite

Wang, J., Lian, N., Li, J., Wang, Y., Feng, Y., Chen, B., … Xia, S.-T. (2025). Efficient Self-Supervised Video Hashing with Selective State Spaces. Proceedings of the AAAI Conference on Artificial Intelligence, 39(7), 7753–7761. https://doi.org/10.1609/aaai.v39i7.32835

Issue

Section

AAAI Technical Track on Computer Vision VI