ViT-UNet: A Vision Transformer Based UNet Model for Coastal Wetland Classification Based on High Spatial Resolution Imagery

Nan Zhou; Mingming Xu; Biaoqun Shen; Ke Hou; Shanwei Liu; Hui Sheng; Yanfen Liu; Jianhua Wan

doi:10.1109/JSTARS.2024.3487250

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)

ViT-UNet: A Vision Transformer Based UNet Model for Coastal Wetland Classification Based on High Spatial Resolution Imagery

Nan Zhou,
Mingming Xu,
Biaoqun Shen,
Ke Hou,
Shanwei Liu,
Hui Sheng,
Yanfen Liu,
Jianhua Wan

Affiliations

Nan Zhou: ORCiD; College of Oceanography and Space Informatics, China University of Petroleum (East China), Qingdao, China
Mingming Xu: ORCiD; College of Oceanography and Space Informatics, China University of Petroleum (East China), Qingdao, China
Biaoqun Shen: ORCiD; Shandong Lubang Geographic Information Engineering Company Ltd., Jinan, China
Ke Hou: Shandong Provincial Institute of Land Surveying and Mapping, Jinan, China
Shanwei Liu: ORCiD; College of Oceanography and Space Informatics, China University of Petroleum (East China), Qingdao, China
Hui Sheng: ORCiD; College of Oceanography and Space Informatics, China University of Petroleum (East China), Qingdao, China
Yanfen Liu: ORCiD; Observation and Research Station of Bohai Strait Eco-Corridor, MNR, Qingdao, China
Jianhua Wan: ORCiD; College of Oceanography and Space Informatics, China University of Petroleum (East China), Qingdao, China

DOI: https://doi.org/10.1109/JSTARS.2024.3487250
Journal volume & issue: Vol. 17
pp. 19575 – 19587

Abstract

Read online

High resolution remote sensing imagery plays a crucial role in monitoring coastal wetlands. Coastal wetland landscapes exhibit diverse features, ranging from fragmented patches to expansive areas. Mainstream convolutional neural networks cannot effectively analyze spatial relationships among consecutive image elements. This limitation impedes their performance in accurately classifying coastal wetlands. In order to tackle the above issues, we propose a Vision Transformer based UNet (ViT-UNet) model. This model extracts wetland features from high resolution remote sensing images by sensing and optimizing multiscale features. To establish global dependencies, the Vision Transformer (ViT) is introduced to replace the convolutional layer in the UNet encoder. Simultaneously, the model incorporates a convolutional block attention module and a multiple hierarchies attention module to restore attentional features and reduce feature loss. In addition, a skip connection is added to the single-skip structure of the original UNet model. This connection simultaneously links the output of the entire transformer and internal attention features to the corresponding decoder level. This enhancement aims to furnish the decoder with comprehensive global information guidance. Finally, all the extracted feature information is fused using Bilinear Polymerization Pooling (BPP). The BPP assists the network in obtaining a more comprehensive and detailed feature representation. Experimental results on the Gaofen-1 dataset demonstrate that the proposed ViT-UNet method achieves a Precision score of 93.50$\%$, outperforming the original UNet model by 4.10$\%$. Compared with other state-of-the-art networks, ViT-UNet performs more accurately and finer in the extraction of wetland information in the Yellow River Delta.

Published in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

ISSN: 1939-1404 (Print); 2151-1535 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Ocean engineering; Science: Physics: Geophysics. Cosmic physics
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=4609443

About the journal

Abstract

Keywords