A Novel Transformer Network with a CNN-Enhanced Cross-Attention Mechanism for Hyperspectral Image Classification

Xinyu Wang; Le Sun; Chuhan Lu; Baozhu Li

doi:10.3390/rs16071180

Remote Sensing (Mar 2024)

A Novel Transformer Network with a CNN-Enhanced Cross-Attention Mechanism for Hyperspectral Image Classification

Xinyu Wang,
Le Sun,
Chuhan Lu,
Baozhu Li

Affiliations

Xinyu Wang: School of Computer Science, Nanjing University of Information Science and Technology, Nanjing 210044, China
Le Sun: School of Computer Science, Nanjing University of Information Science and Technology, Nanjing 210044, China
Chuhan Lu: School of Atmospheric Sciences, Nanjing University of Information Science and Technology, Nanjing 210044, China
Baozhu Li: Internet of Things & Smart City Innovation Platform, Zhuhai Fudan Innovation Institute, Zhuhai 519031, China

DOI: https://doi.org/10.3390/rs16071180
Journal volume & issue: Vol. 16, no. 7
p. 1180

Abstract

Read online

Recently, with the remarkable advancements of deep learning in the field of image processing, convolutional neural networks (CNNs) have garnered widespread attention from researchers in the domain of hyperspectral image (HSI) classification. Moreover, due to the high performance demonstrated by the transformer architecture in classification tasks, there has been a proliferation of neural networks combining CNNs and transformers for HSI classification. However, the majority of the current methods focus on extracting spatial–spectral features from the HSI data of a single size for a pixel, overlooking the rich multi-scale feature information inherent to the data. To address this problem, we designed a novel transformer network with a CNN-enhanced cross-attention (TNCCA) mechanism for HSI classification. It is a dual-branch network that utilizes different scales of HSI input data to extract shallow spatial–spectral features using a multi-scale 3D and 2D hybrid convolutional neural network. After converting the feature maps into tokens, a series of 2D convolutions and dilated convolutions are employed to generate two sets of Q (queries), K (keys), and V (values) at different scales in a cross-attention module. This transformer with CNN-enhanced cross-attention explores multi-scale CNN-enhanced features and fuses them from both branches. Experimental evaluations conducted on three widely used hyperspectral image (HSI) datasets, under the constraint of limited sample size, demonstrate excellent classification performance of the proposed network.

Published in Remote Sensing

ISSN: 2072-4292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/remotesensing/

About the journal

Abstract

Keywords