Remote Sensing (Dec 2024)
Multiple Hierarchical Cross-Scale Transformer for Remote Sensing Scene Classification
Abstract
The Transformer model can capture global contextual information but does not have an inherent inductive bias. In contrast, convolutional neural networks (CNNs) are highly praised in computer vision due to their strong inductive bias and local spatial correlation. To combine the advantages of the two model types, we propose a multiple hierarchical cross-scale Transformer model that efficiently combines the Transformer model with CNNs and is specifically designed for complex remote sensing scene classification. Firstly, a feature pyramid network with attention aggregation extracts the multi-scale base features. Then, these base features are fed into the proposed multi-scale channel Transformer (MSCT) module to derive the global features with channel-wise attention. Additionally, the base features are also fed into the proposed hierarchical cross-scale Transformer (HCST) module, which can obtain multi-level cross-scale representations. Lastly, the outputs from both modules are taken into account to calculate the final classification score. The performance of the proposed method has been validated for its effectiveness on three public datasets: AID, UCM, and NWPU-RESISC45.
Keywords