IEEE Access (Jan 2024)

MSGFormer: A DeepLabv3+ Like Semantically Masked and Pixel Contrast Transformer for MouseHole Segmentation

  • Peng Yang,
  • Chunmei Li,
  • Chengwu Fang,
  • Shasha Kong,
  • Yunpeng Jin,
  • Kai Li,
  • Haiyang Li,
  • Xiangjie Huang,
  • Yaosheng Han

DOI
https://doi.org/10.1109/ACCESS.2024.3372146
Journal volume & issue
Vol. 12
pp. 33544 – 33554

Abstract

Read online

In semantic segmentation, the efficient representation of multi-scale context is of paramount importance. Inspired by the remarkable performance of Vision Transformers (ViT) in image classification, subsequent researchers have proposed some Semantic Segmentation ViTs, most of which have achieved impressive results. However, these models often struggle to effectively utilizing multi-scale context, disregarding intra-image semantic context, and neglecting the global context of training data, i.e., the semantic relationships among pixels across different images. In this paper, we introduce the Sliding Window Dilated Attention and combine it with the Spatial Pyramid Pooling (SPP) to form a novel decoder called Sliding window dilated attention spatial pyramid pooling(SwinASPP). By adjusting the sliding window dilation rates, this decoder is capable of capturing multi-scale contextual information from different granularities. Additionally, we propose the Semantic Attention Block, which integrates semantic attention operations into the encoder. And adopt our proposed supervised pixel-wise contrastive learning algorithm, we shift the current training strategy to inter-image for semantic segmentation. Our experiments demonstrate that these methods lead to performance improvements on the SanJiangYuan MouseHole dataset and Cityscapes.

Keywords