OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs
DOI:
https://doi.org/10.1609/aaai.v40i24.39087Abstract
Existing sparse attention methods primarily target inference-time acceleration by selecting critical tokens under predefined sparsity patterns. However, they often fail to bridge the training–inference gap and lack the capacity for fine-grained token selection across multiple dimensions—such as queries, key-values (KV), and heads—leading to suboptimal performance and acceleration gains. In this paper, we introduce OmniSparse, a training-aware fine-grained sparse attention of long-video MLLMs, which is applied in both training and inference with dynamic token budget allocation. Specifically, OmniSparse contains three adaptive and complementary mechanisms: (1) query selection as lazy-active classification, aiming to retain active queries that capture broader semantic similarity, while discarding most of lazy ones that focus on limited local context and exhibit high functional redundancy with their neighbors, (2) KV selection with head-level dynamic budget allocation, where a shared budget is determined based on the flattest head and applied uniformly across all heads to ensure attention recall after selection, and (3) KV cache slimming to alleviate head-level redundancy, which selectively fetches visual KV cache according to the head-level decoding query pattern. Experimental results demonstrate that OmniSparse can achieve comparable performance with full attention, achieving 2.7x speedup during prefill and 2.4x memory reduction for decoding.Downloads
Published
2026-03-14
How to Cite
Chen, F., He, Y., He, S., He, Y., Liu, J., Lin, L., Liu, A., Li, Z., Zhang, J., Sun, Z., Zhuang, B., & Wu, Q. (2026). OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(24), 20023-20031. https://doi.org/10.1609/aaai.v40i24.39087
Issue
Section
AAAI Technical Track on Machine Learning I