MemeBQ:Memory Efficient Binary Quantization of LLMs

Authors

  • Yuanhui Wang Sanya Nanhai Innovation and Development Base of Harbin Engineering University
  • Kunlong Liu Sanya Nanhai Innovation and Development Base of Harbin Engineering University
  • Minnan Pei C2DL,Institute of Automation, Chinese Academy of Sciences
  • Zhangming Li C2DL,Institute of Automation, Chinese Academy of Sciences
  • Peisong Wang C2DL,Institute of Automation, Chinese Academy of Sciences Nanjing Artificial Intelligence Research of IA (AiRiA) University of Chinese Academy of Science, Nanjing
  • Qinghao Hu C2DL,Institute of Automation, Chinese Academy of Sciences Nanjing Artificial Intelligence Research of IA (AiRiA) University of Chinese Academy of Science, Nanjing

DOI:

https://doi.org/10.1609/aaai.v40i31.39881

Abstract

Recent years have witnessed growing scholarly interest in binary post-training quantization (PTQ) techniques for large language models (LLMs). While state-of-the-art (SOTA) binary quantization methods significantly reduce memory footprint and computational demands, they introduce additional memory overhead beyond binary weight tensors to mitigate performance degradation. Moreover, binary LLMs still suffer from substantial accuracy loss. To address these limitations, we propose MemeBQ, a novel binary PTQ framework for LLMs that reduces the memory overhead of auxiliary flag bitmaps in existing binary quantization methods. Specifically, we first design a greedy row clustering method, which leverages the similarity between the row vectors of weights to partition the weight rows into different groups. By sharing the common flag bitmap within each row group, we significantly mitigate the memory overhead associated with flag bitmaps. Besides, to improve the performance of binary LLMs, we propose a novel weight splitting method for each row group of weights, which determines the flag bitmap's values in a fine-grained way. Extensive experiments on OPT, Llama-2, and Llama-3 models demonstrate that MemeBQ reduces 50% extra memory demand while achieving comparable accuracy compared with current SOTA methods. Alternatively, MemeBQ outperforms SOTA binary quantization methods up to 7% with the same extra bits on reasoning benchmarks.

Downloads

Published

2026-03-14

How to Cite

Wang, Y., Liu, K., Pei, M., Li, Z., Wang, P., & Hu, Q. (2026). MemeBQ:Memory Efficient Binary Quantization of LLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(31), 26715–26722. https://doi.org/10.1609/aaai.v40i31.39881

Issue

Section

AAAI Technical Track on Machine Learning VIII