DesireKV: Decoupling Sensitivity and Importance for Reasoning-Aware KV Cache Compression

Authors

  • Pengyu Cheng Xi'an Jiaotong University
  • Jiacheng Wang Xi'an Jiaotong University
  • Tianle Chen Xi'an Jiaotong University
  • Bei Liu Hong Kong University of Science and Technology
  • Xiaofeng Hou Shanghai Jiaotong University
  • Jiacheng Liu Hong Kong University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v40i25.39187

Abstract

Large language models performing chain-of-thought (CoT) reasoning generate extensive intermediate sequences that consume substantial memory through key-value (KV) cache storage. Unlike conventional text generation, reasoning sequences exhibit unique characteristics, including repetitive logic patterns and low information density, making existing KV cache compression methods suboptimal. We propose DesireKV, a novel compression framework that first constructs a two-dimensional coordinate system based on attention-derived importance and outlier-based quantization sensitivity. It then applies a dedicated protection mechanism for tokens critical to the reasoning process itself. Our approach makes differentiated compression decisions: retaining important and sensitive tokens, quantizing important but insensitive tokens, and evicting unimportant tokens. Through comprehensive evaluation on reasoning benchmarks, we demonstrate that DesireKV achieves up to 2.93× throughput improvement while maintaining nearly 99% of original reasoning accuracy.

Downloads

Published

2026-03-14

How to Cite

Cheng, P., Wang, J., Chen, T., Liu, B., Hou, X., & Liu, J. (2026). DesireKV: Decoupling Sensitivity and Importance for Reasoning-Aware KV Cache Compression. Proceedings of the AAAI Conference on Artificial Intelligence, 40(25), 20518-20526. https://doi.org/10.1609/aaai.v40i25.39187

Issue

Section

AAAI Technical Track on Machine Learning II