IEEE Access (Jan 2023)

Enhancing Semantically Masked Transformer With Local Attention for Semantic Segmentation

  • Zhengyu Xia,
  • Joohee Kim

DOI
https://doi.org/10.1109/ACCESS.2023.3329435
Journal volume & issue
Vol. 11
pp. 122345 – 122356

Abstract

Read online

Transformer-based semantic segmentation has been applied to various visual recognition applications and achieved outstanding performance in recent years. Since most of these approaches adopt a pretrained backbone and finetune it for semantic segmentation, they are not efficient in capturing semantic contextual information during the encoding stage, leading to sub-optimal segmentation performance. To address this problem, SeMask proposes a semantic attention operation to incorporate the semantic contextual information of an image during the encoding stage and improves the segmentation performance. However, the architecture of SeMask is entirely based on the attention mechanisms of Transformers and has some limitations to fully exploit the local details, which are important for more accurate segmentation. In this paper, we introduce a novel semantic layer into the encoder side of a Transformer-based segmentation model. The proposed semantic layer consists of depthwise convolutions with different kernel sizes to capture multi-scale local details. It is integrated at different stages of a hierarchical Transformer backbone to acquire multi-scale semantic contextual information on the encoder side to improve the overall segmentation performance, especially for more accurate segmentation of small objects. Our proposed method can be integrated with common segmentation models such as Semantic-FPN and Mask Transformers. Experimental results show that our proposed method can achieve state-of-the-art performance on the ADE20K dataset with 58.24% mIoU and the Cityscapes dataset with 84.97% mIoU.

Keywords