OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs

Authors

  • Feng Chen AIML, The University of Adelaide
  • Yefei He Zhejiang University
  • Shaoxuan He Zhejiang University
  • Yuanyu He Zhejiang University
  • Jing Liu Monash University
  • Lequan Lin The University of Sydney
  • Akide Liu Monash University
  • Zhaoyang Li TikTok
  • Jiyuan Zhang TikTok
  • Zhenbang Sun TikTok
  • Bohan Zhuang Zhejiang University
  • Qi Wu AIML, The University of Adelaide

DOI:

https://doi.org/10.1609/aaai.v40i24.39087

Abstract

Existing sparse attention methods primarily target inference-time acceleration by selecting critical tokens under predefined sparsity patterns. However, they often fail to bridge the training–inference gap and lack the capacity for fine-grained token selection across multiple dimensions—such as queries, key-values (KV), and heads—leading to suboptimal performance and acceleration gains. In this paper, we introduce OmniSparse, a training-aware fine-grained sparse attention of long-video MLLMs, which is applied in both training and inference with dynamic token budget allocation. Specifically, OmniSparse contains three adaptive and complementary mechanisms: (1) query selection as lazy-active classification, aiming to retain active queries that capture broader semantic similarity, while discarding most of lazy ones that focus on limited local context and exhibit high functional redundancy with their neighbors, (2) KV selection with head-level dynamic budget allocation, where a shared budget is determined based on the flattest head and applied uniformly across all heads to ensure attention recall after selection, and (3) KV cache slimming to alleviate head-level redundancy, which selectively fetches visual KV cache according to the head-level decoding query pattern. Experimental results demonstrate that OmniSparse can achieve comparable performance with full attention, achieving 2.7x speedup during prefill and 2.4x memory reduction for decoding.

Downloads

Published

2026-03-14

How to Cite

Chen, F., He, Y., He, S., He, Y., Liu, J., Lin, L., Liu, A., Li, Z., Zhang, J., Sun, Z., Zhuang, B., & Wu, Q. (2026). OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(24), 20023-20031. https://doi.org/10.1609/aaai.v40i24.39087

Issue

Section

AAAI Technical Track on Machine Learning I