PTMQ: Post-training Multi-Bit Quantization of Neural Networks

Ke Xu; Zhongcheng Li; Shanshan Wang; Xingyi Zhang

doi:10.1609/aaai.v38i14.29553

Authors

Ke Xu Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University School of Artificial Intelligence, Anhui University
Zhongcheng Li School of Artificial Intelligence, Anhui University
Shanshan Wang Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University
Xingyi Zhang Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University School of Computer Science and Technology, Anhui University

DOI:

https://doi.org/10.1609/aaai.v38i14.29553

Keywords:

ML: Learning on the Edge & Model Compression, CV: Learning & Optimization for CV

Abstract

The ability of model quantization with arbitrary bit-width to dynamically meet diverse bit-width requirements during runtime has attracted significant attention. Recent research has focused on optimizing large-scale training methods to achieve robust bit-width adaptation, which is a time-consuming process requiring hundreds of GPU hours. Furthermore, converting bit-widths requires recalculating statistical parameters of the norm layers, thereby impeding real-time switching of the bit-width. To overcome these challenges, we propose an efficient Post-Training Multi-bit Quantization (PTMQ) scheme that requires only a small amount of calibration data to perform block-wise reconstruction of multi-bit quantization errors. It eliminates the influence of statistical parameters by fusing norm layers, and supports real-time switching bit-widths in uniform quantization and mixed-precision quantization. To improve quantization accuracy and robustness, we propose a Multi-bit Feature Mixer technique (MFM) for fusing features of different bit-widths to enhance robustness across varying bit-widths. Moreover, we introduced the Group-wise Distillation Loss (GD-Loss) to enhance the correlation between different bit-width groups and further improve the overall performance of PTMQ. Extensive experiments demonstrate that PTMQ achieves comparable performance to existing state-of-the-art post-training quantization methods, while optimizing it speeds up by 100$\times$ compared to recent multi-bit quantization works. Code can be available at https://github.com/xuke225/PTMQ.

PTMQ: Post-training Multi-Bit Quantization of Neural Networks

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription