MemeBQ:Memory Efficient Binary Quantization of LLMs

Yuanhui Wang; Kunlong Liu; Minnan Pei; Zhangming Li; Peisong Wang; Qinghao Hu

doi:10.1609/aaai.v40i31.39881

Authors

Yuanhui Wang Sanya Nanhai Innovation and Development Base of Harbin Engineering University
Kunlong Liu Sanya Nanhai Innovation and Development Base of Harbin Engineering University
Minnan Pei C2DL,Institute of Automation, Chinese Academy of Sciences
Zhangming Li C2DL,Institute of Automation, Chinese Academy of Sciences
Peisong Wang C2DL,Institute of Automation, Chinese Academy of Sciences Nanjing Artificial Intelligence Research of IA (AiRiA) University of Chinese Academy of Science, Nanjing
Qinghao Hu C2DL,Institute of Automation, Chinese Academy of Sciences Nanjing Artificial Intelligence Research of IA (AiRiA) University of Chinese Academy of Science, Nanjing

DOI:

https://doi.org/10.1609/aaai.v40i31.39881

Abstract

Recent years have witnessed growing scholarly interest in binary post-training quantization (PTQ) techniques for large language models (LLMs). While state-of-the-art (SOTA) binary quantization methods significantly reduce memory footprint and computational demands, they introduce additional memory overhead beyond binary weight tensors to mitigate performance degradation. Moreover, binary LLMs still suffer from substantial accuracy loss. To address these limitations, we propose MemeBQ, a novel binary PTQ framework for LLMs that reduces the memory overhead of auxiliary flag bitmaps in existing binary quantization methods. Specifically, we first design a greedy row clustering method, which leverages the similarity between the row vectors of weights to partition the weight rows into different groups. By sharing the common flag bitmap within each row group, we significantly mitigate the memory overhead associated with flag bitmaps. Besides, to improve the performance of binary LLMs, we propose a novel weight splitting method for each row group of weights, which determines the flag bitmap's values in a fine-grained way. Extensive experiments on OPT, Llama-2, and Llama-3 models demonstrate that MemeBQ reduces 50% extra memory demand while achieving comparable accuracy compared with current SOTA methods. Alternatively, MemeBQ outperforms SOTA binary quantization methods up to 7% with the same extra bits on reasoning benchmarks.

MemeBQ:Memory Efficient Binary Quantization of LLMs

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information