Transformer-Based Decoder Designs for Semantic Segmentation on Remotely Sensed Images

Teerapong Panboonyuen; Kulsawasd Jitkajornwanich; Siam Lawawirojwong; Panu Srestasathiern; Peerapon Vateekul

doi:10.3390/rs13245100

Remote Sensing (Dec 2021)

Transformer-Based Decoder Designs for Semantic Segmentation on Remotely Sensed Images

Teerapong Panboonyuen,
Kulsawasd Jitkajornwanich,
Siam Lawawirojwong,
Panu Srestasathiern,
Peerapon Vateekul

Affiliations

Teerapong Panboonyuen: Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University, Phayathai Rd, Pathumwan, Bangkok 10330, Thailand
Kulsawasd Jitkajornwanich: Data Science and Computational Intelligence (DSCI) Laboratory, Department of Computer Science, King Mongkut’s Institute of Technology Ladkrabang, Chalongkrung Rd, Ladkrabang, Bangkok 10520, Thailand
Siam Lawawirojwong: Geo-Informatics and Space Technology Development Agency (Public Organization), 120, The Government Complex, Chaeng Wattana Rd, Lak Si, Bangkok 10210, Thailand
Panu Srestasathiern: Geo-Informatics and Space Technology Development Agency (Public Organization), 120, The Government Complex, Chaeng Wattana Rd, Lak Si, Bangkok 10210, Thailand
Peerapon Vateekul: Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University, Phayathai Rd, Pathumwan, Bangkok 10330, Thailand

DOI: https://doi.org/10.3390/rs13245100
Journal volume & issue: Vol. 13, no. 24
p. 5100

Abstract

Read online

Transformers have demonstrated remarkable accomplishments in several natural language processing (NLP) tasks as well as image processing tasks. Herein, we present a deep-learning (DL) model that is capable of improving the semantic segmentation network in two ways. First, utilizing the pre-training Swin Transformer (SwinTF) under Vision Transformer (ViT) as a backbone, the model weights downstream tasks by joining task layers upon the pretrained encoder. Secondly, decoder designs are applied to our DL network with three decoder designs, U-Net, pyramid scene parsing (PSP) network, and feature pyramid network (FPN), to perform pixel-level segmentation. The results are compared with other image labeling state of the art (SOTA) methods, such as global convolutional network (GCN) and ViT. Extensive experiments show that our Swin Transformer (SwinTF) with decoder designs reached a new state of the art on the Thailand Isan Landsat-8 corpus (89.8% F1 score), Thailand North Landsat-8 corpus (63.12% F1 score), and competitive results on ISPRS Vaihingen. Moreover, both our best-proposed methods (SwinTF-PSP and SwinTF-FPN) even outperformed SwinTF with supervised pre-training ViT on the ImageNet-1K in the Thailand, Landsat-8, and ISPRS Vaihingen corpora.

Published in Remote Sensing

ISSN: 2072-4292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/remotesensing/

About the journal

Abstract

Keywords