Bi-ViT: Pushing the Limit of Vision Transformer Quantization

Authors

  • Yanjing Li Beihang University
  • Sheng Xu Beihang University
  • Mingbao Lin Tencent
  • Xianbin Cao Beihang University, China
  • Chuanjian Liu Huawei Noah's Ark Lab
  • Xiao Sun Shanghai Artificial Intelligence Laboratory
  • Baochang Zhang Zhongguancun Laboratory Hangzhou Research Institute, Beihang University Nanchang Institute of Technology

DOI:

https://doi.org/10.1609/aaai.v38i4.28109

Keywords:

CV: Object Detection & Categorization

Abstract

Vision transformers (ViTs) quantization offers a promising prospect to facilitate deploying large pre-trained networks on resource-limited devices. Fully-binarized ViTs (Bi-ViT) that pushes the quantization of ViTs to its limit remain largely unexplored and a very challenging task yet, due to their unacceptable performance. Through extensive empirical analyses, we identify the severe drop in ViT binarization is caused by attention distortion in self-attention, which technically stems from the gradient vanishing and ranking disorder. To address these issues, we first introduce a learnable scaling factor to reactivate the vanished gradients and illustrate its effectiveness through theoretical and experimental analyses. We then propose a ranking-aware distillation method to rectify the disordered ranking in a teacher-student framework. Bi-ViT achieves significant improvements over popular DeiT and Swin backbones in terms of Top-1 accuracy and FLOPs. For example, with DeiT-Tiny and Swin-Tiny, our method significantly outperforms baselines by 22.1% and 21.4% respectively, while 61.5x and 56.1x theoretical acceleration in terms of FLOPs compared with real-valued counterparts on ImageNet. Our codes and models are attached on https://github.com/YanjingLi0202/Bi-ViT/ .

Published

2024-03-24

How to Cite

Li, Y., Xu, S., Lin, M., Cao, X., Liu, C., Sun, X., & Zhang, B. (2024). Bi-ViT: Pushing the Limit of Vision Transformer Quantization. Proceedings of the AAAI Conference on Artificial Intelligence, 38(4), 3243-3251. https://doi.org/10.1609/aaai.v38i4.28109

Issue

Section

AAAI Technical Track on Computer Vision III