IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)

SSNet: A Novel Transformer and CNN Hybrid Network for Remote Sensing Semantic Segmentation

  • Min Yao,
  • Yaozu Zhang,
  • Guofeng Liu,
  • Dongdong Pang

DOI
https://doi.org/10.1109/JSTARS.2024.3349657
Journal volume & issue
Vol. 17
pp. 3023 – 3037

Abstract

Read online

There are still various challenges in remote sensing semantic segmentation due to objects diversity and complexity. Transformer-based models have significant advantages in capturing global feature dependencies for segmentation. However, it unfortunately ignores local feature details. On the other hand, convolutional neural network (CNN), with a different interaction mechanism from transformer-based models, captures more small-scale local features instead of global features. In this article, a new semantic segmentation net framework named SSNet is proposed, which incorporates an encoder–decoder structure, optimizing the advantages of both local and global features. In addition, we build feature fuse module and feature inject module to largely fuse these two-style features. The former module captures the dependencies between different positions and channels to extract multiscale features, which promotes the segmentation precision on similar objects. The latter module condenses the global information in transformer and injects it into CNN to obtain a broad global field of view, in which the depthwise strip convolution improves the segmentation accuracy on tiny objects. A CNN-based decoder progressively recovers the feature map size, and a block called atrous spatial pyramid pooling is adopted in decoder to obtain a multiscale context. The skip connection is established between the decoder and the encoder, which retains important feature information of the shallow layer network and is conducive to achieving flow of multiscale features. To evaluate our model, we compare it with current state-of-the-art models on WHDLD and Potsdam datasets. The experimental results indicate that our proposed model achieves more precise semantic segmentation.

Keywords