State-Space Hierarchical Compression with Gated Attention and Learnable Sampling for Hour-Long Video Understanding in Large Multimodal Models

Geewook Kim; Minjoon Seo

doi:10.1609/aaai.v40i7.37485

State-Space Hierarchical Compression with Gated Attention and Learnable Sampling for Hour-Long Video Understanding in Large Multimodal Models

Authors

Geewook Kim NAVER Cloud AI KAIST AI
Minjoon Seo KAIST AI

DOI:

https://doi.org/10.1609/aaai.v40i7.37485

Abstract

We propose an efficient framework to compress massive video-frame features before feeding them into large multimodal models, thereby mitigating the severe token explosion arising from hour-long videos. Our design leverages a bidirectional state-space model equipped with a gated skip connection and a learnable weighted-average pooling mechanism applied to periodically inserted learned queries. This structure enables hierarchical downsampling across both spatial and temporal dimensions, preserving performance in a cost-effective manner. Across challenging hour-long video understanding tasks, our approach demonstrates competitive results against state-of-the-art models, while significantly reducing overall token budget. Notably, replacing our state-space model with conventional modules results in substantial performance degradation, highlighting the advantages of the proposed state-space modeling for effectively compressing multi-frame video information. Our framework emphasizes resource-conscious efficiency, making it practical for real-world deployments. We validate its scalability and generality across multiple benchmarks, achieving the dual objectives of efficient resource usage and comprehensive video understanding.

AAAI-26 / IAAI-26 / EAAI-26 Proceedings Cover

Downloads

Published

2026-03-14

How to Cite

Kim, G., & Seo, M. (2026). State-Space Hierarchical Compression with Gated Attention and Learnable Sampling for Hour-Long Video Understanding in Large Multimodal Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(7), 5656–5664. https://doi.org/10.1609/aaai.v40i7.37485

Download Citation

Issue

Vol. 40 No. 7: AAAI-26 Technical Tracks 7

Section

AAAI Technical Track on Computer Vision IV

State-Space Hierarchical Compression with Gated Attention and Learnable Sampling for Hour-Long Video Understanding in Large Multimodal Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information