Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off

Authors

  • Mingkuan Zhao Xi'an Jiaotong University
  • Wentao Hu Xi'an Jiaotong University
  • Jiayin Wang Xi'an Jiaotong University
  • Xin Lai Xi'an Jiaotong University
  • Tianchen Huang University of Science and Technology of China
  • Yuheng Min Tsinghua University, Tsinghua University
  • Rui Yan University of California, San Diego
  • Xiaoyan Zhu Xi'an Jiaotong University

DOI:

https://doi.org/10.1609/aaai.v40i41.40800

Abstract

The design of Large Language Models (LLMs) has long been hampered by a fundamental conflict within their core attention mechanism: its remarkable expressivity is built upon a computational complexity of O(H·N²) that grows quadratically with the context size (N) and linearly with the number of heads (H). This standard implementation harbors significant computational redundancy, as all heads independently compute attention over the same sequence space. Existing sparse methods, meanwhile, often trade information integrity for computational efficiency. To resolve this efficiency-performance trade-off, we propose SPAttention, whose core contribution is the introduction of a new paradigm we term Principled Structural Sparsity. SPAttention does not merely drop connections but instead reorganizes the computational task by partitioning the total attention workload into balanced, non-overlapping distance bands, assigning each head a unique segment. This approach transforms the multi-head attention mechanism from H independent O(N²) computations into a single, collaborative O(N²) computation, fundamentally reducing complexity by a factor of H. The structured inductive bias compels functional specialization among heads, enabling a more efficient allocation of computational resources from redundant modeling to distinct dependencies across the entire sequence span. Extensive empirical validation on the OLMoE-1B-7B and 0.25B-1.75B model series demonstrates that while delivering an approximately two-fold increase in training throughput, its performance is on par with standard dense attention, even surpassing it on select key metrics, while consistently outperforming representative sparse attention methods including Longformer, Reformer, and BigBird across all evaluation metrics. Our work demonstrates that thoughtfully designed structural sparsity can serve as an effective inductive bias that simultaneously improves both computational efficiency and model performance, opening a new avenue for the architectural design of next-generation, high-performance LLMs.

Published

2026-03-14

How to Cite

Zhao, M., Hu, W., Wang, J., Lai, X., Huang, T., Min, Y., … Zhu, X. (2026). Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off. Proceedings of the AAAI Conference on Artificial Intelligence, 40(41), 34959–34967. https://doi.org/10.1609/aaai.v40i41.40800

Issue

Section

AAAI Technical Track on Natural Language Processing VI