Hybrid Attention Fusion Embedded in Transformer for Remote Sensing Image Semantic Segmentation

Yan Chen; Quan Dong; Xiaofeng Wang; Qianchuan Zhang; Menglei Kang; Wenxiang Jiang; Mengyuan Wang; Lixiang Xu; Chen Zhang

doi:10.1109/JSTARS.2024.3358851

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)

Hybrid Attention Fusion Embedded in Transformer for Remote Sensing Image Semantic Segmentation

Yan Chen,
Quan Dong,
Xiaofeng Wang,
Qianchuan Zhang,
Menglei Kang,
Wenxiang Jiang,
Mengyuan Wang,
Lixiang Xu,
Chen Zhang

Affiliations

Yan Chen: ORCiD; School of Artificial Intelligence and Big Data, Hefei University, Hefei, China
Quan Dong: ORCiD; School of Artificial Intelligence and Big Data, Hefei University, Hefei, China
Xiaofeng Wang: ORCiD; School of Artificial Intelligence and Big Data, Hefei University, Hefei, China
Qianchuan Zhang: ORCiD; School of Artificial Intelligence and Big Data, Hefei University, Hefei, China
Menglei Kang: ORCiD; School of Artificial Intelligence and Big Data, Hefei University, Hefei, China
Wenxiang Jiang: ORCiD; School of Artificial Intelligence and Big Data, Hefei University, Hefei, China
Mengyuan Wang: ORCiD; School of Artificial Intelligence and Big Data, Hefei University, Hefei, China
Lixiang Xu: ORCiD; School of Artificial Intelligence and Big Data, Hefei University, Hefei, China
Chen Zhang: ORCiD; School of Artificial Intelligence and Big Data, Hefei University, Hefei, China

DOI: https://doi.org/10.1109/JSTARS.2024.3358851
Journal volume & issue: Vol. 17
pp. 4421 – 4435

Abstract

Read online

In the context of fast progress in deep learning, convolutional neural networks have been extensively applied to the semantic segmentation of remote sensing images and have achieved significant progress. However, certain limitations exist in capturing global contextual information due to the characteristics of convolutional local properties. Recently, Transformer has become a focus of research in computer vision and has shown great potential in extracting global contextual information, further promoting the development of semantic segmentation tasks. In this article, we use ResNet50 as an encoder, embed the hybrid attention mechanism into Transformer, and propose a Transformer-based decoder. The Channel-Spatial Transformer Block further aggregates features by integrating the local feature maps extracted by the encoder with their associated global dependencies. At the same time, an adaptive approach is employed to reweight the interdependent channel maps to enhance the feature fusion. The global cross-fusion module combines the extracted complementary features to obtain more comprehensive semantic information. Extensive comparative experiments were conducted on the ISPRS Potsdam and Vaihingen datasets, where mIoU reached 78.06% and 76.37%, respectively. The outcomes of multiple ablation experiments also validate the effectiveness of the proposed method.

Published in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

ISSN: 1939-1404 (Print); 2151-1535 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Ocean engineering; Science: Physics: Geophysics. Cosmic physics
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=4609443

About the journal

Abstract

Keywords