An Efficient Hybrid CNN-Transformer Approach for Remote Sensing Super-Resolution

Wenjian Zhang; Zheng Tan; Qunbo Lv; Jiaao Li; Baoyu Zhu; Yangyang Liu

doi:10.3390/rs16050880

Remote Sensing (Mar 2024)

An Efficient Hybrid CNN-Transformer Approach for Remote Sensing Super-Resolution

Wenjian Zhang,
Zheng Tan,
Qunbo Lv,
Jiaao Li,
Baoyu Zhu,
Yangyang Liu

Affiliations

Wenjian Zhang: Aerospace Information Research Institute, Chinese Academy of Sciences, No. 9 Dengzhuang South Road, Haidian District, Beijing 100094, China
Zheng Tan: Aerospace Information Research Institute, Chinese Academy of Sciences, No. 9 Dengzhuang South Road, Haidian District, Beijing 100094, China
Qunbo Lv: Aerospace Information Research Institute, Chinese Academy of Sciences, No. 9 Dengzhuang South Road, Haidian District, Beijing 100094, China
Jiaao Li: Aerospace Information Research Institute, Chinese Academy of Sciences, No. 9 Dengzhuang South Road, Haidian District, Beijing 100094, China
Baoyu Zhu: Aerospace Information Research Institute, Chinese Academy of Sciences, No. 9 Dengzhuang South Road, Haidian District, Beijing 100094, China
Yangyang Liu: Aerospace Information Research Institute, Chinese Academy of Sciences, No. 9 Dengzhuang South Road, Haidian District, Beijing 100094, China

DOI: https://doi.org/10.3390/rs16050880
Journal volume & issue: Vol. 16, no. 5
p. 880

Abstract

Read online

Transformer models have great potential in the field of remote sensing super-resolution (SR) due to their excellent self-attention mechanisms. However, transformer models are prone to overfitting because of their large number of parameters, especially with the typically small remote sensing datasets. Additionally, the reliance of transformer-based SR models on convolution-based upsampling often leads to mismatched semantic information. To tackle these challenges, we propose an efficient super-resolution hybrid network (EHNet) based on the encoder composed of our designed lightweight convolution module and the decoder composed of an improved swin transformer. The encoder, featuring our novel Lightweight Feature Extraction Block (LFEB), employs a more efficient convolution method than depthwise separable convolution based on depthwise convolution. Our LFEB also integrates a Cross Stage Partial structure for enhanced feature extraction. In terms of the decoder, based on the swin transformer, we innovatively propose a sequence-based upsample block (SUB) for the first time, which directly uses the sequence of tokens in the transformer to focus on semantic information through the MLP layer, which enhances the feature expression ability of the model and improves the reconstruction accuracy. Experiments show that EHNet’s PSNR on UCMerced and AID datasets obtains a SOTA performance of 28.02 and 29.44, respectively, and is also visually better than other existing methods. Its 2.64 M parameters effectively balance model efficiency and computational demands.

Published in Remote Sensing

ISSN: 2072-4292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/remotesensing/

About the journal

Abstract

Keywords