FeedFormer: Revisiting Transformer Decoder for Efficient Semantic Segmentation
DOI:
https://doi.org/10.1609/aaai.v37i2.25321Keywords:
CV: Segmentation, CV: Vision for Robotics & Autonomous DrivingAbstract
With the success of Vision Transformer (ViT) in image classification, its variants have yielded great success in many downstream vision tasks. Among those, the semantic segmentation task has also benefited greatly from the advance of ViT variants. However, most studies of the transformer for semantic segmentation only focus on designing efficient transformer encoders, rarely giving attention to designing the decoder. Several studies make attempts in using the transformer decoder as the segmentation decoder with class-wise learnable query. Instead, we aim to directly use the encoder features as the queries. This paper proposes the Feature Enhancing Decoder transFormer (FeedFormer) that enhances structural information using the transformer decoder. Our goal is to decode the high-level encoder features using the lowest-level encoder feature. We do this by formulating high-level features as queries, and the lowest-level feature as the key and value. This enhances the high-level features by collecting the structural information from the lowest-level feature. Additionally, we use a simple reformation trick of pushing the encoder blocks to take the place of the existing self-attention module of the decoder to improve efficiency. We show the superiority of our decoder with various light-weight transformer-based decoders on popular semantic segmentation datasets. Despite the minute computation, our model has achieved state-of-the-art performance in the performance computation trade-off. Our model FeedFormer-B0 surpasses SegFormer-B0 with 1.8% higher mIoU and 7.1% less computation on ADE20K, and 1.7% higher mIoU and 14.4% less computation on Cityscapes, respectively. Code will be released at: https://github.com/jhshim1995/FeedFormer.Downloads
Published
2023-06-26
How to Cite
Shim, J.- hun, Yu, H., Kong, K., & Kang, S.-J. (2023). FeedFormer: Revisiting Transformer Decoder for Efficient Semantic Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 37(2), 2263-2271. https://doi.org/10.1609/aaai.v37i2.25321
Issue
Section
AAAI Technical Track on Computer Vision II