A U-Shaped Convolution-Aided Transformer with Double Attention for Hyperspectral Image Classification

Ruiru Qin; Chuanzhi Wang; Yongmei Wu; Huafei Du; Mingyun Lv

doi:10.3390/rs16020288

Remote Sensing (Jan 2024)

A U-Shaped Convolution-Aided Transformer with Double Attention for Hyperspectral Image Classification

Ruiru Qin,
Chuanzhi Wang,
Yongmei Wu,
Huafei Du,
Mingyun Lv

Affiliations

Ruiru Qin: School of Aeronautic Science and Engineering, Beihang University, Beijing 100191, China
Chuanzhi Wang: School of Aeronautic Science and Engineering, Beihang University, Beijing 100191, China
Yongmei Wu: School of Aeronautic Science and Engineering, Beihang University, Beijing 100191, China
Huafei Du: School of Aeronautic Science and Engineering, Beihang University, Beijing 100191, China
Mingyun Lv: School of Aeronautic Science and Engineering, Beihang University, Beijing 100191, China

DOI: https://doi.org/10.3390/rs16020288
Journal volume & issue: Vol. 16, no. 2
p. 288

Abstract

Read online

Convolutional neural networks (CNNs) and transformers have achieved great success in hyperspectral image (HSI) classification. However, CNNs are inefficient in establishing long-range dependencies, and transformers may overlook some local information. To overcome these limitations, we propose a U-shaped convolution-aided transformer (UCaT) that incorporates convolutions into a novel transformer architecture to aid classification. The group convolution is employed as parallel local descriptors to extract detailed features, and then the multi-head self-attention recalibrates these features in consistent groups, emphasizing informative features while maintaining the inherent spectral–spatial data structure. Specifically, three components are constructed using particular strategies. First, the spectral groupwise self-attention (spectral-GSA) component is developed for spectral attention, which selectively emphasizes diagnostic spectral features among neighboring bands and reduces the spectral dimension. Then, the spatial dual-scale convolution-aided self-attention (spatial-DCSA) encoder and spatial convolution-aided cross-attention (spatial-CCA) decoder form a U-shaped architecture for per-pixel classifications over HSI patches, where the encoder utilizes a dual-scale strategy to explore information in different scales and the decoder adopts the cross-attention for information fusion. Experimental results on three datasets demonstrate that the proposed UCaT outperforms the competitors. Additionally, a visual explanation of the UCaT is given, showing its ability to build global interactions and capture pixel-level dependencies.

Published in Remote Sensing

ISSN: 2072-4292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/remotesensing/

About the journal

Abstract

Keywords