Remote Sensing (Dec 2024)

Multiple Hierarchical Cross-Scale Transformer for Remote Sensing Scene Classification

  • Dan Zhang,
  • Wenping Ma,
  • Licheng Jiao,
  • Xu Liu,
  • Yuting Yang,
  • Fang Liu

DOI
https://doi.org/10.3390/rs17010042
Journal volume & issue
Vol. 17, no. 1
p. 42

Abstract

Read online

The Transformer model can capture global contextual information but does not have an inherent inductive bias. In contrast, convolutional neural networks (CNNs) are highly praised in computer vision due to their strong inductive bias and local spatial correlation. To combine the advantages of the two model types, we propose a multiple hierarchical cross-scale Transformer model that efficiently combines the Transformer model with CNNs and is specifically designed for complex remote sensing scene classification. Firstly, a feature pyramid network with attention aggregation extracts the multi-scale base features. Then, these base features are fed into the proposed multi-scale channel Transformer (MSCT) module to derive the global features with channel-wise attention. Additionally, the base features are also fed into the proposed hierarchical cross-scale Transformer (HCST) module, which can obtain multi-level cross-scale representations. Lastly, the outputs from both modules are taken into account to calculate the final classification score. The performance of the proposed method has been validated for its effectiveness on three public datasets: AID, UCM, and NWPU-RESISC45.

Keywords