Multiple Hierarchical Cross-Scale Transformer for Remote Sensing Scene Classification

Dan Zhang; Wenping Ma; Licheng Jiao; Xu Liu; Yuting Yang; Fang Liu

doi:10.3390/rs17010042

Remote Sensing (Dec 2024)

Multiple Hierarchical Cross-Scale Transformer for Remote Sensing Scene Classification

Dan Zhang,
Wenping Ma,
Licheng Jiao,
Xu Liu,
Yuting Yang,
Fang Liu

Affiliations

Dan Zhang: Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University, Xi’an 710071, China
Wenping Ma: Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University, Xi’an 710071, China
Licheng Jiao: Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University, Xi’an 710071, China
Xu Liu: Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University, Xi’an 710071, China
Yuting Yang: Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University, Xi’an 710071, China
Fang Liu: Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University, Xi’an 710071, China

DOI: https://doi.org/10.3390/rs17010042
Journal volume & issue: Vol. 17, no. 1
p. 42

Abstract

Read online

The Transformer model can capture global contextual information but does not have an inherent inductive bias. In contrast, convolutional neural networks (CNNs) are highly praised in computer vision due to their strong inductive bias and local spatial correlation. To combine the advantages of the two model types, we propose a multiple hierarchical cross-scale Transformer model that efficiently combines the Transformer model with CNNs and is specifically designed for complex remote sensing scene classification. Firstly, a feature pyramid network with attention aggregation extracts the multi-scale base features. Then, these base features are fed into the proposed multi-scale channel Transformer (MSCT) module to derive the global features with channel-wise attention. Additionally, the base features are also fed into the proposed hierarchical cross-scale Transformer (HCST) module, which can obtain multi-level cross-scale representations. Lastly, the outputs from both modules are taken into account to calculate the final classification score. The performance of the proposed method has been validated for its effectiveness on three public datasets: AID, UCM, and NWPU-RESISC45.

Published in Remote Sensing

ISSN: 2072-4292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/remotesensing/

About the journal

Abstract

Keywords