Efficient Multimodal Large Language Model via Dynamic KV Cache Quantization

Authors

  • Jiahao Fan University of Sydney
  • Chien-Ming Chen Nanjing University of Information Science and Technology

DOI:

https://doi.org/10.1609/aaai.v40i25.39241

Abstract

Multimodal large language models (LMMs) have demonstrated remarkable capabilities across diverse vision-language tasks, including image captioning, visual question answering, and text-image retrieval. However, their computational complexity and memory footprint, particularly in the key-value (KV) cache during inference, pose significant challenges for real-time deployment, especially on resource-constrained devices. In this paper, we propose Dynamic KV Cache Quantization, a novel quantization strategy tailored for multimodal LMMs. Our approach applies per-channel quantization to (K) and per-token quantization to (V), leveraging their respective statistical distributions to optimize precision allocation. Additionally, we introduce an adaptive token and channel recording mechanism that dynamically adjusts quantization parameters based on real-time distribution tracking, effectively mitigating the impact of outliers. To further enhance compression efficiency, we implement fine-grained grouping, which partitions KV tensors into localized subgroups, enabling more adaptive quantization. Experimental results on LLaVA-1.5 (7B/13B) and Qwen-VL across multiple multimodal benchmarks demonstrate that our method significantly outperforms existing KV-cache quantization approaches, achieving a superior trade-off between memory efficiency and model accuracy.

Downloads

Published

2026-03-14

How to Cite

Fan, J., & Chen, C.-M. (2026). Efficient Multimodal Large Language Model via Dynamic KV Cache Quantization. Proceedings of the AAAI Conference on Artificial Intelligence, 40(25), 20994–21001. https://doi.org/10.1609/aaai.v40i25.39241

Issue

Section

AAAI Technical Track on Machine Learning II