FLRQ: Faster LLM Quantization with Flexible Low-Rank Matrix Sketching

Hongyaoxing Gu; Lijuan Hu; Shuzi Niu; Fangfang Liu

doi:10.1609/aaai.v40i26.39283

Authors

Hongyaoxing Gu Institute of Software Chinese Academy of Sciences University of the Chinese Academy of Sciences
Lijuan Hu Institute of Software Chinese Academy of Sciences
Shuzi Niu Institute of Software Chinese Academy of Sciences
Fangfang Liu Institute of Software Chinese Academy of Sciences Key Laboratory of System Software (Chinese Academy of Sciences)

DOI:

https://doi.org/10.1609/aaai.v40i26.39283

Abstract

Traditional post-training quantization (PTQ) is considered an effective approach to reduce model size and accelerate inference of large-scale language models (LLMs). However, existing low-rank PTQ methods require costly fine-tuning to determine a compromise rank for diverse data and layers in large models, failing to exploit their full potential. Additionally, the current SVD-based low-rank approximation compounds the computational overhead. In this work, we thoroughly analyze the varying effectiveness of low-rank approximation across different layers in representative models. Accordingly, we introduce Flexible Low-Rank Quantization (FLRQ), a novel solution designed to quickly identify the accuracy-optimal ranks and aggregate them to achieve minimal storage combinations. FLRQ comprises two powerful components, Rank1-Sketch-based Flexible Rank Selection (R1-FLR) and Best Low-rank Approximation under Clipping (BLC). R1-FLR applies the R1-Sketch with Gaussian projection for the fast low-rank approximation, enabling outlier-aware rank extraction for each layer. Meanwhile, BLC aims at minimizing the low-rank quantization error under the scaling and clipping strategy through an iterative method. FLRQ demonstrates strong effectiveness and robustness in comprehensive experiments, achieving state-of-the-art performance in both quantization quality and algorithm efficiency.

FLRQ: Faster LLM Quantization with Flexible Low-Rank Matrix Sketching

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information