Deeply Tensor Compressed Transformers for End-to-End Object Detection

Peining Zhen; Ziyang Gao; Tianshu Hou; Yuan Cheng; Hai-Bao Chen

doi:10.1609/aaai.v36i4.20397

Authors

Peining Zhen Shanghai Jiao Tong University
Ziyang Gao Shanghai Jiao Tong university
Tianshu Hou Shanghai Jiao Tong University
Yuan Cheng Shanghai Jiao Tong University
Hai-Bao Chen Shanghai Jiao Tong University

DOI:

https://doi.org/10.1609/aaai.v36i4.20397

Keywords:

Domain(s) Of Application (APP), Computer Vision (CV)

Abstract

DEtection TRansformer (DETR) is a recently proposed method that streamlines the detection pipeline and achieves competitive results against two-stage detectors such as Faster-RCNN. The DETR models get rid of complex anchor generation and post-processing procedures thereby making the detection pipeline more intuitive. However, the numerous redundant parameters in transformers make the DETR models computation and storage intensive, which seriously hinder them to be deployed on the resources-constrained devices. In this paper, to obtain a compact end-to-end detection framework, we propose to deeply compress the transformers with low-rank tensor decomposition. The basic idea of the tensor-based compression is to represent the large-scale weight matrix in one network layer with a chain of low-order matrices. Furthermore, we propose a gated multi-head attention (GMHA) module to mitigate the accuracy drop of the tensor-compressed DETR models. In GMHA, each attention head has an independent gate to determine the passed attention value. The redundant attention information can be suppressed by adopting the normalized gates. Lastly, to obtain fully compressed DETR models, a low-bitwidth quantization technique is introduced for further reducing the model storage size. Based on the proposed methods, we can achieve significant parameter and model size reduction while maintaining high detection performance. We conduct extensive experiments on the COCO dataset to validate the effectiveness of our tensor-compressed (tensorized) DETR models. The experimental results show that we can attain 3.7 times full model compression with 482 times feed forward network (FFN) parameter reduction and only 0.6 points accuracy drop.

Deeply Tensor Compressed Transformers for End-to-End Object Detection

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription