Bi-VLM: Binary Post-Training Quantization for Vision-Language Models
DOI:
https://doi.org/10.1609/aaai.v40i12.37989Abstract
We address the critical gap between the computational demands of vision-language models and the possible ultra-low-bit weight precision (bitwidth <= 2 bits) we can use for higher efficiency. Our work is motivated by the substantial computational cost and memory requirements of VLMs, which restrict their applicability in hardware-constrained environments. We propose Bi-VLM, which separates model weights non-uniformly based on the Gaussian quantiles. Our formulation groups the model weights into outlier and multiple inlier subsets, ensuring that each subset contains a proportion of weights corresponding to its quantile in the distribution. We propose a saliency-aware hybrid quantization algorithm and use it to quantize weights by imposing different constraints on the scaler and binary matrices based on the saliency metric and compression objective. We have evaluated our approach on different VLMs. For the language model part of the VLM, our Bi-VLM outperforms the SOTA by 3%-47% on the visual question answering task in terms of four different benchmarks and three different models. For the overall VLM, our Bi-VLM outperforms the SOTA by 4%-45%.Published
2026-03-14
How to Cite
Wang, X., Abdalla, R., Huang, J., Zhang, C., Xian, R., & Manocha, D. (2026). Bi-VLM: Binary Post-Training Quantization for Vision-Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(12), 10207-10215. https://doi.org/10.1609/aaai.v40i12.37989
Issue
Section
AAAI Technical Track on Computer Vision IX