IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)
Semantic Segmentation of Remote Sensing Images With Transformer-Based U-Net and Guided Focal-Axial Attention
Abstract
In the field of remote sensing, semantic segmentation of unmanned aerial vehicle (UAV) imagery is crucial for tasks such as land resource management, urban planning, precision agriculture, and economic assessment. Traditional methods use convolutional neural networks (CNNs) for hierarchical feature extraction but are limited by their local receptive fields, restricting comprehensive contextual understanding. To overcome these limitations, we propose a combination of transformer and attention mechanisms to improve object classification, leveraging their superior information modeling capabilities to enhance scene understanding. In this article, we present Swin-based focal axial attention network (SwinFAN), a U-Net framework featuring a Swin transformer as encoder, equipped with a novel decoder that introduces two new components for enhanced semantic segmentation of urban remote sensing images. The first proposed component is a guided focal-axial (GFA) attention module that combines local and global contextual information, enhancing the model's ability to discern intricate details and complex structures. The second component is an innovative attention-based feature refinement head (AFRH) designed to improve the precision and clarity of segmentation outputs through self-attention and convolutional techniques. Comprehensive experiments demonstrate that the accuracy of our proposed architecture significantly outperforms state-of-the-art models. More specifically, our method achieves mean intersection over union (mIoU) improvements of 1.9% on UAVid, 3.6% on Potsdam, 1.9% on Vaihingen, and 0.8% on LoveDA.
Keywords