SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation

Authors

  • Zhengze Xu Huazhong University of Science and Technology
  • Dongyue Wu Huazhong University of Science and Technology
  • Changqian Yu Meituan
  • Xiangxiang Chu Meituan
  • Nong Sang Huazhong University of Science and Technology
  • Changxin Gao Huazhong University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v38i6.28457

Keywords:

CV: Segmentation, CV: Scene Analysis & Understanding, ML: Deep Learning Algorithms

Abstract

Recent real-time semantic segmentation methods usually adopt an additional semantic branch to pursue rich long-range context. However, the additional branch incurs undesirable computational overhead and slows inference speed. To eliminate this dilemma, we propose SCTNet, a single branch CNN with transformer semantic information for real-time segmentation. SCTNet enjoys the rich semantic representations of an inference-free semantic branch while retaining the high efficiency of lightweight single branch CNN. SCTNet utilizes a transformer as the training-only semantic branch considering its superb ability to extract long-range context. With the help of the proposed transformer-like CNN block CFBlock and the semantic information alignment module, SCTNet could capture the rich semantic information from the transformer branch in training. During the inference, only the single branch CNN needs to be deployed. We conduct extensive experiments on Cityscapes, ADE20K, and COCO-Stuff-10K, and the results show that our method achieves the new state-of-the-art performance. The code and model is available at https://github.com/xzz777/SCTNet.

Published

2024-03-24

How to Cite

Xu, Z., Wu, D., Yu, C., Chu, X., Sang, N., & Gao, C. (2024). SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 38(6), 6378-6386. https://doi.org/10.1609/aaai.v38i6.28457

Issue

Section

AAAI Technical Track on Computer Vision V