Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention

Sitong Wu; Tianyi Wu; Haoru Tan; Guodong Guo

doi:10.1609/aaai.v36i3.20176

Authors

Sitong Wu Institute of Deep Learning, Baidu Research National Engineering Laboratory for Deep Learning Technology and Application
Tianyi Wu Institute of Deep Learning, Baidu Research National Engineering Laboratory for Deep Learning Technology and Application
Haoru Tan School of Artificial Intelligence, University of Chinese Academy of Sciences
Guodong Guo Institute of Deep Learning, Baidu Research National Engineering Laboratory for Deep Learning Technology and Application

DOI:

https://doi.org/10.1609/aaai.v36i3.20176

Keywords:

Computer Vision (CV)

Abstract

Recently, Transformers have shown promising performance in various vision tasks. To reduce the quadratic computation complexity caused by the global self-attention, various methods constrain the range of attention within a local region to improve its efficiency. Consequently, their receptive fields in a single attention layer are not large enough, resulting in insufficient context modeling. To address this issue, we propose a Pale-Shaped self-Attention (PS-Attention), which performs self-attention within a pale-shaped region. Compared to the global self-attention, PS-Attention can reduce the computation and memory costs significantly. Meanwhile, it can capture richer contextual information under the similar computation complexity with previous local self-attention mechanisms. Based on the PS-Attention, we develop a general Vision Transformer backbone with a hierarchical architecture, named Pale Transformer, which achieves 83.4%, 84.3%, and 84.9% Top-1 accuracy with the model size of 22M, 48M, and 85M respectively for 224x224 ImageNet-1K classification, outperforming the previous Vision Transformer backbones. For downstream tasks, our Pale Transformer backbone performs better than the recent state-of-the-art CSWin Transformer by a large margin on ADE20K semantic segmentation and COCO object detection & instance segmentation. The code will be released on https://github.com/BR-IDL/PaddleViT.

Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information