MemeBQ:Memory Efficient Binary Quantization of LLMs
DOI:
https://doi.org/10.1609/aaai.v40i31.39881Abstract
Recent years have witnessed growing scholarly interest in binary post-training quantization (PTQ) techniques for large language models (LLMs). While state-of-the-art (SOTA) binary quantization methods significantly reduce memory footprint and computational demands, they introduce additional memory overhead beyond binary weight tensors to mitigate performance degradation. Moreover, binary LLMs still suffer from substantial accuracy loss. To address these limitations, we propose MemeBQ, a novel binary PTQ framework for LLMs that reduces the memory overhead of auxiliary flag bitmaps in existing binary quantization methods. Specifically, we first design a greedy row clustering method, which leverages the similarity between the row vectors of weights to partition the weight rows into different groups. By sharing the common flag bitmap within each row group, we significantly mitigate the memory overhead associated with flag bitmaps. Besides, to improve the performance of binary LLMs, we propose a novel weight splitting method for each row group of weights, which determines the flag bitmap's values in a fine-grained way. Extensive experiments on OPT, Llama-2, and Llama-3 models demonstrate that MemeBQ reduces 50% extra memory demand while achieving comparable accuracy compared with current SOTA methods. Alternatively, MemeBQ outperforms SOTA binary quantization methods up to 7% with the same extra bits on reasoning benchmarks.Downloads
Published
2026-03-14
How to Cite
Wang, Y., Liu, K., Pei, M., Li, Z., Wang, P., & Hu, Q. (2026). MemeBQ:Memory Efficient Binary Quantization of LLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(31), 26715–26722. https://doi.org/10.1609/aaai.v40i31.39881
Issue
Section
AAAI Technical Track on Machine Learning VIII