FeedFormer: Revisiting Transformer Decoder for Efficient Semantic Segmentation

Jae-hun Shim; Hyunwoo Yu; Kyeongbo Kong; Suk-Ju Kang

doi:10.1609/aaai.v37i2.25321

Authors

Jae-hun Shim Sogang University
Hyunwoo Yu Sogang university
Kyeongbo Kong Pukyong National University
Suk-Ju Kang Sogang University

DOI:

https://doi.org/10.1609/aaai.v37i2.25321

Keywords:

CV: Segmentation, CV: Vision for Robotics & Autonomous Driving

Abstract

With the success of Vision Transformer (ViT) in image classification, its variants have yielded great success in many downstream vision tasks. Among those, the semantic segmentation task has also benefited greatly from the advance of ViT variants. However, most studies of the transformer for semantic segmentation only focus on designing efficient transformer encoders, rarely giving attention to designing the decoder. Several studies make attempts in using the transformer decoder as the segmentation decoder with class-wise learnable query. Instead, we aim to directly use the encoder features as the queries. This paper proposes the Feature Enhancing Decoder transFormer (FeedFormer) that enhances structural information using the transformer decoder. Our goal is to decode the high-level encoder features using the lowest-level encoder feature. We do this by formulating high-level features as queries, and the lowest-level feature as the key and value. This enhances the high-level features by collecting the structural information from the lowest-level feature. Additionally, we use a simple reformation trick of pushing the encoder blocks to take the place of the existing self-attention module of the decoder to improve efficiency. We show the superiority of our decoder with various light-weight transformer-based decoders on popular semantic segmentation datasets. Despite the minute computation, our model has achieved state-of-the-art performance in the performance computation trade-off. Our model FeedFormer-B0 surpasses SegFormer-B0 with 1.8% higher mIoU and 7.1% less computation on ADE20K, and 1.7% higher mIoU and 14.4% less computation on Cityscapes, respectively. Code will be released at: https://github.com/jhshim1995/FeedFormer.

FeedFormer: Revisiting Transformer Decoder for Efficient Semantic Segmentation

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription