Bi-VLM: Binary Post-Training Quantization for Vision-Language Models

Authors

  • Xijun Wang University of Maryland, College Park
  • Rayyan Abdalla University of Maryland, College Park
  • Junyun Huang University of Maryland, College Park
  • Chengyuan Zhang University of Maryland, College Park
  • Ruiqi Xian University of Maryland, College Park
  • Dinesh Manocha University of Maryland, College Park

DOI:

https://doi.org/10.1609/aaai.v40i12.37989

Abstract

We address the critical gap between the computational demands of vision-language models and the possible ultra-low-bit weight precision (bitwidth <= 2 bits) we can use for higher efficiency. Our work is motivated by the substantial computational cost and memory requirements of VLMs, which restrict their applicability in hardware-constrained environments. We propose Bi-VLM, which separates model weights non-uniformly based on the Gaussian quantiles. Our formulation groups the model weights into outlier and multiple inlier subsets, ensuring that each subset contains a proportion of weights corresponding to its quantile in the distribution. We propose a saliency-aware hybrid quantization algorithm and use it to quantize weights by imposing different constraints on the scaler and binary matrices based on the saliency metric and compression objective. We have evaluated our approach on different VLMs. For the language model part of the VLM, our Bi-VLM outperforms the SOTA by 3%-47% on the visual question answering task in terms of four different benchmarks and three different models. For the overall VLM, our Bi-VLM outperforms the SOTA by 4%-45%.

Downloads

Published

2026-03-14

How to Cite

Wang, X., Abdalla, R., Huang, J., Zhang, C., Xian, R., & Manocha, D. (2026). Bi-VLM: Binary Post-Training Quantization for Vision-Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(12), 10207-10215. https://doi.org/10.1609/aaai.v40i12.37989

Issue

Section

AAAI Technical Track on Computer Vision IX