Compressing Transformers: Features Are Low-Rank, but Weights Are Not!

Hao Yu; Jianxin Wu

doi:10.1609/aaai.v37i9.26304

Authors

Hao Yu Nanjing University
Jianxin Wu Nanjing University

DOI:

https://doi.org/10.1609/aaai.v37i9.26304

Keywords:

ML: Learning on the Edge & Model Compression, CV: Representation Learning for Vision, ML: Deep Neural Network Algorithms, SNLP: Language Models

Abstract

Transformer and its variants achieve excellent results in various computer vision and natural language processing tasks, but high computational costs and reliance on large training datasets restrict their deployment in resource-constrained settings. Low-rank approximation of model weights has been effective in compressing CNN models, but its application to transformers has been less explored and is less effective. Existing methods require the complete dataset to fine-tune compressed models, which are both time-consuming and data-hungry. This paper reveals that the features (i.e., activations) are low-rank, but model weights are surprisingly not low-rank. Hence, AAFM is proposed, which adaptively determines the compressed model structure and locally compresses each linear layer's output features rather than the model weights. A second stage, GFM, optimizes the entire compressed network holistically. Both AAFM and GFM only use few training samples without labels, that is, they are few-shot, unsupervised, fast and effective. For example, with only 2K images without labels, 33% of the parameters are removed in DeiT-B with 18.8% relative throughput increase, but only a 0.23% accuracy loss for ImageNet recognition. The proposed methods are successfully applied to the language modeling task in NLP, too. Besides, the few-shot compressed models generalize well in downstream tasks.

Compressing Transformers: Features Are Low-Rank, but Weights Are Not!

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription