KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache

Authors

  • Fei Li School of Computer Science and Technology, Xi’an Jiaotong University
  • Song Liu School of Computer Science and Technology, Xi’an Jiaotong University
  • Weiguo Wu School of Computer Science and Technology, Xi’an Jiaotong University
  • Shiqiang Nie School of Computer Science and Technology, Xi’an Jiaotong University
  • Jinyu Wang School of Computer Science and Technology, Xi’an Jiaotong University

DOI:

https://doi.org/10.1609/aaai.v40i37.40422

Abstract

The high memory demands of the Key-Value (KV) Cache during the inference of Large Language Models (LLMs) severely restrict their deployment in resource-constrained platforms. Quantization can effectively alleviate the memory pressure caused by KV Cache. However, existing methods either rely on static one-size-fits-all precision allocation or fail to dynamically prioritize critical KV in long-context tasks, forcing memory-accuracy-throughput tradeoffs. In this work, we propose a novel mixed-precision quantization method for KV Cache named KVmix. KVmix leverages gradient-based importance analysis to evaluate how individual Key and Value projection matrices affect the model loss, enabling layer-specific bit-width allocation for mix-precision quantization. It dynamically prioritizes higher precision for important layers while aggressively quantizing less influential ones, achieving a tunable balance between accuracy and efficiency. KVmix introduces a dynamic long-context optimization strategy that adaptively keeps full-precision KV pairs for recent pivotal tokens and compresses older ones, achieving high-quality sequence generation with low memory usage. Additionally, KVmix provides efficient low-bit quantization and CUDA kernels to optimize computational overhead. On LLMs such as Llama and Mistral, KVmix achieves near-lossless inference performance with extremely low quantization configuration (Key 2.19bit Value 2.38bit), while delivering a remarkable 4.9× memory compression and a 5.3× speedup in inference throughput.

Published

2026-03-14

How to Cite

Li, F., Liu, S., Wu, W., Nie, S., & Wang, J. (2026). KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37), 31563–31572. https://doi.org/10.1609/aaai.v40i37.40422

Issue

Section

AAAI Technical Track on Natural Language Processing II