MSGFormer: A DeepLabv3&#x002B; Like Semantically Masked and Pixel Contrast Transformer for MouseHole Segmentation

Peng Yang; Chunmei Li; Chengwu Fang; Shasha Kong; Yunpeng Jin; Kai Li; Haiyang Li; Xiangjie Huang; Yaosheng Han

doi:10.1109/ACCESS.2024.3372146

IEEE Access (Jan 2024)

MSGFormer: A DeepLabv3+ Like Semantically Masked and Pixel Contrast Transformer for MouseHole Segmentation

Peng Yang,
Chunmei Li,
Chengwu Fang,
Shasha Kong,
Yunpeng Jin,
Kai Li,
Haiyang Li,
Xiangjie Huang,
Yaosheng Han

Affiliations

Peng Yang: ORCiD; Department of Computer Technology and Application, Qinghai University, Xining, China
Chunmei Li: ORCiD; Department of Computer Technology and Application, Qinghai University, Xining, China
Chengwu Fang: Department of Computer Technology and Application, Qinghai University, Xining, China
Shasha Kong: Department of Computer Technology and Application, Qinghai University, Xining, China
Yunpeng Jin: Department of Computer Technology and Application, Qinghai University, Xining, China
Kai Li: Department of Computer Technology and Application, Qinghai University, Xining, China
Haiyang Li: Department of Computer Technology and Application, Qinghai University, Xining, China
Xiangjie Huang: Department of Computer Technology and Application, Qinghai University, Xining, China
Yaosheng Han: Department of Computer Technology and Application, Qinghai University, Xining, China

DOI: https://doi.org/10.1109/ACCESS.2024.3372146
Journal volume & issue: Vol. 12
pp. 33544 – 33554

Abstract

Read online

In semantic segmentation, the efficient representation of multi-scale context is of paramount importance. Inspired by the remarkable performance of Vision Transformers (ViT) in image classification, subsequent researchers have proposed some Semantic Segmentation ViTs, most of which have achieved impressive results. However, these models often struggle to effectively utilizing multi-scale context, disregarding intra-image semantic context, and neglecting the global context of training data, i.e., the semantic relationships among pixels across different images. In this paper, we introduce the Sliding Window Dilated Attention and combine it with the Spatial Pyramid Pooling (SPP) to form a novel decoder called Sliding window dilated attention spatial pyramid pooling(SwinASPP). By adjusting the sliding window dilation rates, this decoder is capable of capturing multi-scale contextual information from different granularities. Additionally, we propose the Semantic Attention Block, which integrates semantic attention operations into the encoder. And adopt our proposed supervised pixel-wise contrastive learning algorithm, we shift the current training strategy to inter-image for semantic segmentation. Our experiments demonstrate that these methods lead to performance improvements on the SanJiangYuan MouseHole dataset and Cityscapes.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords