OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs

Feng Chen; Yefei He; Shaoxuan He; Yuanyu He; Jing Liu; Lequan Lin; Akide Liu; Zhaoyang Li; Jiyuan Zhang; Zhenbang Sun; Bohan Zhuang; Qi Wu

doi:10.1609/aaai.v40i24.39087

Authors

Feng Chen AIML, The University of Adelaide
Yefei He Zhejiang University
Shaoxuan He Zhejiang University
Yuanyu He Zhejiang University
Jing Liu Monash University
Lequan Lin The University of Sydney
Akide Liu Monash University
Zhaoyang Li TikTok
Jiyuan Zhang TikTok
Zhenbang Sun TikTok
Bohan Zhuang Zhejiang University
Qi Wu AIML, The University of Adelaide

DOI:

https://doi.org/10.1609/aaai.v40i24.39087

Abstract

Existing sparse attention methods primarily target inference-time acceleration by selecting critical tokens under predefined sparsity patterns. However, they often fail to bridge the training–inference gap and lack the capacity for fine-grained token selection across multiple dimensions—such as queries, key-values (KV), and heads—leading to suboptimal performance and acceleration gains. In this paper, we introduce OmniSparse, a training-aware fine-grained sparse attention of long-video MLLMs, which is applied in both training and inference with dynamic token budget allocation. Specifically, OmniSparse contains three adaptive and complementary mechanisms: (1) query selection as lazy-active classification, aiming to retain active queries that capture broader semantic similarity, while discarding most of lazy ones that focus on limited local context and exhibit high functional redundancy with their neighbors, (2) KV selection with head-level dynamic budget allocation, where a shared budget is determined based on the flattest head and applied uniformly across all heads to ensure attention recall after selection, and (3) KV cache slimming to alleviate head-level redundancy, which selectively fetches visual KV cache according to the head-level decoding query pattern. Experimental results demonstrate that OmniSparse can achieve comparable performance with full attention, achieving 2.7x speedup during prefill and 2.4x memory reduction for decoding.

OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information