Bi-VLM: Binary Post-Training Quantization for Vision-Language Models

Xijun Wang; Rayyan Abdalla; Junyun Huang; Chengyuan Zhang; Ruiqi Xian; Dinesh Manocha

doi:10.1609/aaai.v40i12.37989

Authors

Xijun Wang University of Maryland, College Park
Rayyan Abdalla University of Maryland, College Park
Junyun Huang University of Maryland, College Park
Chengyuan Zhang University of Maryland, College Park
Ruiqi Xian University of Maryland, College Park
Dinesh Manocha University of Maryland, College Park

DOI:

https://doi.org/10.1609/aaai.v40i12.37989

Abstract

We address the critical gap between the computational demands of vision-language models and the possible ultra-low-bit weight precision (bitwidth <= 2 bits) we can use for higher efficiency. Our work is motivated by the substantial computational cost and memory requirements of VLMs, which restrict their applicability in hardware-constrained environments. We propose Bi-VLM, which separates model weights non-uniformly based on the Gaussian quantiles. Our formulation groups the model weights into outlier and multiple inlier subsets, ensuring that each subset contains a proportion of weights corresponding to its quantile in the distribution. We propose a saliency-aware hybrid quantization algorithm and use it to quantize weights by imposing different constraints on the scaler and binary matrices based on the saliency metric and compression objective. We have evaluated our approach on different VLMs. For the language model part of the VLM, our Bi-VLM outperforms the SOTA by 3%-47% on the visual question answering task in terms of four different benchmarks and three different models. For the overall VLM, our Bi-VLM outperforms the SOTA by 4%-45%.

Bi-VLM: Binary Post-Training Quantization for Vision-Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information