FeedFormer: Revisiting Transformer Decoder for Efficient Semantic Segmentation

Authors

  • Jae-hun Shim Sogang University
  • Hyunwoo Yu Sogang university
  • Kyeongbo Kong Pukyong National University
  • Suk-Ju Kang Sogang University

DOI:

https://doi.org/10.1609/aaai.v37i2.25321

Keywords:

CV: Segmentation, CV: Vision for Robotics & Autonomous Driving

Abstract

With the success of Vision Transformer (ViT) in image classification, its variants have yielded great success in many downstream vision tasks. Among those, the semantic segmentation task has also benefited greatly from the advance of ViT variants. However, most studies of the transformer for semantic segmentation only focus on designing efficient transformer encoders, rarely giving attention to designing the decoder. Several studies make attempts in using the transformer decoder as the segmentation decoder with class-wise learnable query. Instead, we aim to directly use the encoder features as the queries. This paper proposes the Feature Enhancing Decoder transFormer (FeedFormer) that enhances structural information using the transformer decoder. Our goal is to decode the high-level encoder features using the lowest-level encoder feature. We do this by formulating high-level features as queries, and the lowest-level feature as the key and value. This enhances the high-level features by collecting the structural information from the lowest-level feature. Additionally, we use a simple reformation trick of pushing the encoder blocks to take the place of the existing self-attention module of the decoder to improve efficiency. We show the superiority of our decoder with various light-weight transformer-based decoders on popular semantic segmentation datasets. Despite the minute computation, our model has achieved state-of-the-art performance in the performance computation trade-off. Our model FeedFormer-B0 surpasses SegFormer-B0 with 1.8% higher mIoU and 7.1% less computation on ADE20K, and 1.7% higher mIoU and 14.4% less computation on Cityscapes, respectively. Code will be released at: https://github.com/jhshim1995/FeedFormer.

Downloads

Published

2023-06-26

How to Cite

Shim, J.- hun, Yu, H., Kong, K., & Kang, S.-J. (2023). FeedFormer: Revisiting Transformer Decoder for Efficient Semantic Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 37(2), 2263-2271. https://doi.org/10.1609/aaai.v37i2.25321

Issue

Section

AAAI Technical Track on Computer Vision II