DesireKV: Decoupling Sensitivity and Importance for Reasoning-Aware KV Cache Compression
DOI:
https://doi.org/10.1609/aaai.v40i25.39187Abstract
Large language models performing chain-of-thought (CoT) reasoning generate extensive intermediate sequences that consume substantial memory through key-value (KV) cache storage. Unlike conventional text generation, reasoning sequences exhibit unique characteristics, including repetitive logic patterns and low information density, making existing KV cache compression methods suboptimal. We propose DesireKV, a novel compression framework that first constructs a two-dimensional coordinate system based on attention-derived importance and outlier-based quantization sensitivity. It then applies a dedicated protection mechanism for tokens critical to the reasoning process itself. Our approach makes differentiated compression decisions: retaining important and sensitive tokens, quantizing important but insensitive tokens, and evicting unimportant tokens. Through comprehensive evaluation on reasoning benchmarks, we demonstrate that DesireKV achieves up to 2.93× throughput improvement while maintaining nearly 99% of original reasoning accuracy.Downloads
Published
2026-03-14
How to Cite
Cheng, P., Wang, J., Chen, T., Liu, B., Hou, X., & Liu, J. (2026). DesireKV: Decoupling Sensitivity and Importance for Reasoning-Aware KV Cache Compression. Proceedings of the AAAI Conference on Artificial Intelligence, 40(25), 20518-20526. https://doi.org/10.1609/aaai.v40i25.39187
Issue
Section
AAAI Technical Track on Machine Learning II