IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)

ViT-UNet: A Vision Transformer Based UNet Model for Coastal Wetland Classification Based on High Spatial Resolution Imagery

  • Nan Zhou,
  • Mingming Xu,
  • Biaoqun Shen,
  • Ke Hou,
  • Shanwei Liu,
  • Hui Sheng,
  • Yanfen Liu,
  • Jianhua Wan

DOI
https://doi.org/10.1109/JSTARS.2024.3487250
Journal volume & issue
Vol. 17
pp. 19575 – 19587

Abstract

Read online

High resolution remote sensing imagery plays a crucial role in monitoring coastal wetlands. Coastal wetland landscapes exhibit diverse features, ranging from fragmented patches to expansive areas. Mainstream convolutional neural networks cannot effectively analyze spatial relationships among consecutive image elements. This limitation impedes their performance in accurately classifying coastal wetlands. In order to tackle the above issues, we propose a Vision Transformer based UNet (ViT-UNet) model. This model extracts wetland features from high resolution remote sensing images by sensing and optimizing multiscale features. To establish global dependencies, the Vision Transformer (ViT) is introduced to replace the convolutional layer in the UNet encoder. Simultaneously, the model incorporates a convolutional block attention module and a multiple hierarchies attention module to restore attentional features and reduce feature loss. In addition, a skip connection is added to the single-skip structure of the original UNet model. This connection simultaneously links the output of the entire transformer and internal attention features to the corresponding decoder level. This enhancement aims to furnish the decoder with comprehensive global information guidance. Finally, all the extracted feature information is fused using Bilinear Polymerization Pooling (BPP). The BPP assists the network in obtaining a more comprehensive and detailed feature representation. Experimental results on the Gaofen-1 dataset demonstrate that the proposed ViT-UNet method achieves a Precision score of 93.50$\%$, outperforming the original UNet model by 4.10$\%$. Compared with other state-of-the-art networks, ViT-UNet performs more accurately and finer in the extraction of wetland information in the Yellow River Delta.

Keywords