Head-Aware KV Cache Compression for Efficient Visual Autoregressive Modeling

Authors

  • Ziran Qin Shanghai Jiaotong University
  • Youru Lv Shanghai Jiaotong University
  • Mingbao Lin Rakuten Asia Pte. Ltd.
  • Hang Guo Tsinghua University
  • Zeren Zhang Peking University
  • Danping Zou Shanghai Jiaotong University
  • Weiyao Lin Shanghai Jiao Tong University

DOI:

https://doi.org/10.1609/aaai.v40i30.39686

Abstract

Visual Autoregressive (VAR) models adopt a next-scale prediction paradigm, offering high-quality content generation with substantially fewer decoding steps. However, existing VAR models suffer from significant attention complexity and severe memory overhead due to the accumulation of key-value (KV) caches across scales. In this paper, we tackle this challenge by introducing KV cache compression into the next-scale generation paradigm. We begin with a crucial observation: attention heads in VAR models can be divided into two functionally distinct categories: Contextual Heads focus on maintaining semantic consistency, while Structural Heads are responsible for preserving spatial coherence. This structural divergence causes existing one-size-fits-all compression methods to perform poorly on VAR models. To address this, we propose HACK, a training-free Head-Aware KV cache Compression frameworK. HACK utilizes an offline classification scheme to separate head types, enabling it to apply pattern-specific compression strategies with asymmetric cache budgets for each category. By doing so, HACK effectively constrains the average KV cache length within a fixed budget B, reducing the theoretical attention complexity from O(n4) to O(Bn2). Extensive experiments on multiple VAR models across text-to-image and class-conditional tasks validate the effectiveness and generalizability of HACK. It achieves up to 70% KV cache compression without degrading output quality, resulting in memory savings and faster in- ference. For example, HACK provides a 1.75× memory reduction and a 1.57× speedup on Infinity-8B.

Downloads

Published

2026-03-14

How to Cite

Qin, Z., Lv, Y., Lin, M., Guo, H., Zhang, Z., Zou, D., & Lin, W. (2026). Head-Aware KV Cache Compression for Efficient Visual Autoregressive Modeling. Proceedings of the AAAI Conference on Artificial Intelligence, 40(30), 24982-24990. https://doi.org/10.1609/aaai.v40i30.39686

Issue

Section

AAAI Technical Track on Machine Learning VII