CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation

Authors

  • Hongxuan Zhang Nanjing University Ant Group
  • Yao Zhao Ant Group
  • Jiaqi Zheng Nanjing University
  • Chenyi Zhuang Ant Group
  • Jinjie Gu Ant Group
  • Guihai Chen Nanjing University

DOI:

https://doi.org/10.1609/aaai.v39i24.34779

Abstract

The emergence of long-context text applications utilizing large language models (LLMs) has presented significant scalability challenges, particularly in memory footprint. The linear growth of the Key-Value (KV) cache, which stores attention keys and values to reduce redundant computations, can significantly increase memory usage and may prevent models from functioning properly in memory-constrained environments. To address this issue, we propose a novel approach called Cache Sparse Representation (CSR), which converts the KV cache by transforming the dense Key-Value cache tensor into sparse indexes and weights, offering a more memory-efficient representation during LLM inference. Furthermore, we introduce NeuralDict, a novel neural network-based method to automatically generate the dictionary used in our sparse representation. Our extensive experiments demonstrate that CSR matches the performance of state-of-the-art KV cache quantization algorithms while ensuring robust functionality in memory-constrained environments.

Published

2025-04-11

How to Cite

Zhang, H., Zhao, Y., Zheng, J., Zhuang, C., Gu, J., & Chen, G. (2025). CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation. Proceedings of the AAAI Conference on Artificial Intelligence, 39(24), 25860–25867. https://doi.org/10.1609/aaai.v39i24.34779

Issue

Section

AAAI Technical Track on Natural Language Processing III