Remote Sensing (Aug 2024)

AerialFormer: Multi-Resolution Transformer for Aerial Image Segmentation

  • Taisei Hanyu,
  • Kashu Yamazaki,
  • Minh Tran,
  • Roy A. McCann,
  • Haitao Liao,
  • Chase Rainwater,
  • Meredith Adkins,
  • Jackson Cothren,
  • Ngan Le

DOI
https://doi.org/10.3390/rs16162930
Journal volume & issue
Vol. 16, no. 16
p. 2930

Abstract

Read online

When performing remote sensing image segmentation, practitioners often encounter various challenges, such as a strong imbalance in the foreground–background, the presence of tiny objects, high object density, intra-class heterogeneity, and inter-class homogeneity. To overcome these challenges, this paper introduces AerialFormer, a hybrid model that strategically combines the strengths of Transformers and Convolutional Neural Networks (CNNs). AerialFormer features a CNN Stem module integrated to preserve low-level and high-resolution features, enhancing the model’s capability to process details of aerial imagery. The proposed AerialFormer is designed with a hierarchical structure, in which a Transformer encoder generates multi-scale features and a multi-dilated CNN (MDC) decoder aggregates the information from the multi-scale inputs. As a result, information is taken into account in both local and global contexts, so that powerful representations and high-resolution segmentation can be achieved. The proposed AerialFormer was benchmarked on three benchmark datasets, including iSAID, LoveDA, and Potsdam. Comprehensive experiments and extensive ablation studies show that the proposed AerialFormer remarkably outperforms state-of-the-art methods.

Keywords