Efficient Multimodal Large Language Model via Dynamic KV Cache Quantization

Jiahao Fan; Chien-Ming Chen

doi:10.1609/aaai.v40i25.39241

Authors

Jiahao Fan University of Sydney
Chien-Ming Chen Nanjing University of Information Science and Technology

DOI:

https://doi.org/10.1609/aaai.v40i25.39241

Abstract

Multimodal large language models (LMMs) have demonstrated remarkable capabilities across diverse vision-language tasks, including image captioning, visual question answering, and text-image retrieval. However, their computational complexity and memory footprint, particularly in the key-value (KV) cache during inference, pose significant challenges for real-time deployment, especially on resource-constrained devices. In this paper, we propose Dynamic KV Cache Quantization, a novel quantization strategy tailored for multimodal LMMs. Our approach applies per-channel quantization to (K) and per-token quantization to (V), leveraging their respective statistical distributions to optimize precision allocation. Additionally, we introduce an adaptive token and channel recording mechanism that dynamically adjusts quantization parameters based on real-time distribution tracking, effectively mitigating the impact of outliers. To further enhance compression efficiency, we implement fine-grained grouping, which partitions KV tensors into localized subgroups, enabling more adaptive quantization. Experimental results on LLaVA-1.5 (7B/13B) and Qwen-VL across multiple multimodal benchmarks demonstrate that our method significantly outperforms existing KV-cache quantization approaches, achieving a superior trade-off between memory efficiency and model accuracy.

Efficient Multimodal Large Language Model via Dynamic KV Cache Quantization

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information