Compressing Transformers: Features Are Low-Rank, but Weights Are Not!
DOI:
https://doi.org/10.1609/aaai.v37i9.26304Keywords:
ML: Learning on the Edge & Model Compression, CV: Representation Learning for Vision, ML: Deep Neural Network Algorithms, SNLP: Language ModelsAbstract
Transformer and its variants achieve excellent results in various computer vision and natural language processing tasks, but high computational costs and reliance on large training datasets restrict their deployment in resource-constrained settings. Low-rank approximation of model weights has been effective in compressing CNN models, but its application to transformers has been less explored and is less effective. Existing methods require the complete dataset to fine-tune compressed models, which are both time-consuming and data-hungry. This paper reveals that the features (i.e., activations) are low-rank, but model weights are surprisingly not low-rank. Hence, AAFM is proposed, which adaptively determines the compressed model structure and locally compresses each linear layer's output features rather than the model weights. A second stage, GFM, optimizes the entire compressed network holistically. Both AAFM and GFM only use few training samples without labels, that is, they are few-shot, unsupervised, fast and effective. For example, with only 2K images without labels, 33% of the parameters are removed in DeiT-B with 18.8% relative throughput increase, but only a 0.23% accuracy loss for ImageNet recognition. The proposed methods are successfully applied to the language modeling task in NLP, too. Besides, the few-shot compressed models generalize well in downstream tasks.Downloads
Published
2023-06-26
How to Cite
Yu, H., & Wu, J. (2023). Compressing Transformers: Features Are Low-Rank, but Weights Are Not!. Proceedings of the AAAI Conference on Artificial Intelligence, 37(9), 11007-11015. https://doi.org/10.1609/aaai.v37i9.26304
Issue
Section
AAAI Technical Track on Machine Learning IV