A Swin Transformer-Based Encoding Booster Integrated in U-Shaped Network for Building Extraction

Xiao Xiao; Wenliang Guo; Rui Chen; Yilong Hui; Jianing Wang; Hongyu Zhao

doi:10.3390/rs14112611

Remote Sensing (May 2022)

A Swin Transformer-Based Encoding Booster Integrated in U-Shaped Network for Building Extraction

Xiao Xiao,
Wenliang Guo,
Rui Chen,
Yilong Hui,
Jianing Wang,
Hongyu Zhao

Affiliations

Xiao Xiao: School of Telecommunications Engineering, Xidian University, Xi’an 710071, China
Wenliang Guo: School of Telecommunications Engineering, Xidian University, Xi’an 710071, China
Rui Chen: School of Telecommunications Engineering, Xidian University, Xi’an 710071, China
Yilong Hui: School of Telecommunications Engineering, Xidian University, Xi’an 710071, China
Jianing Wang: School of Artificial Intelligence, Xidian University, Xi’an 710071, China
Hongyu Zhao: State Key Laboratory of Complex Electromagnetic Environment Effects on Electronics and Information System (CEMEE), Luoyang 471003, China

DOI: https://doi.org/10.3390/rs14112611
Journal volume & issue: Vol. 14, no. 11
p. 2611

Abstract

Read online

Building extraction is a popular topic in remote sensing image processing. Efficient building extraction algorithms can identify and segment building areas to provide informative data for downstream tasks. Currently, building extraction is mainly achieved by deep convolutional neural networks (CNNs) based on the U-shaped encoder–decoder architecture. However, the local perceptive field of the convolutional operation poses a challenge for CNNs to fully capture the semantic information of large buildings, especially in high-resolution remote sensing images. Considering the recent success of the Transformer in computer vision tasks, in this paper, first we propose a shifted-window (swin) Transformer-based encoding booster. The proposed encoding booster includes a swin Transformer pyramid containing patch merging layers for down-sampling, which enables our encoding booster to extract semantics from multi-level features at different scales. Most importantly, the receptive field is significantly expanded by the global self-attention mechanism of the swin Transformer, allowing the encoding booster to capture the large-scale semantic information effectively and transcend the limitations of CNNs. Furthermore, we integrate the encoding booster in a specially designed U-shaped network through a novel manner, named the Swin Transformer-based Encoding Booster- U-shaped Network (STEB-UNet), to achieve the feature-level fusion of local and large-scale semantics. Remarkably, compared with other Transformer-included networks, the computational complexity and memory requirement of the STEB-UNet are significantly reduced due to the swin design, making the network training much easier. Experimental results show that the STEB-UNet can effectively discriminate and extract buildings of different scales and demonstrate higher accuracy than the state-of-the-art networks on public datasets.

Published in Remote Sensing

ISSN: 2072-4292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/remotesensing/

About the journal

Abstract

Keywords