Unsupervised Multi-Scale Hybrid Feature Extraction Network for Semantic Segmentation of High-Resolution Remote Sensing Images

Wanying Song; Fangxin Nie; Chi Wang; Yinyin Jiang; Yan Wu

doi:10.3390/rs16203774

Remote Sensing (Oct 2024)

Unsupervised Multi-Scale Hybrid Feature Extraction Network for Semantic Segmentation of High-Resolution Remote Sensing Images

Wanying Song,
Fangxin Nie,
Chi Wang,
Yinyin Jiang,
Yan Wu

Affiliations

Wanying Song: Xi’an Key Laboratory of Network Convergence Communication, School of Communication and Information Engineering, Xi’an University of Science and Technology, Xi’an 710054, China
Fangxin Nie: Xi’an Key Laboratory of Network Convergence Communication, School of Communication and Information Engineering, Xi’an University of Science and Technology, Xi’an 710054, China
Chi Wang: Xi’an Key Laboratory of Network Convergence Communication, School of Communication and Information Engineering, Xi’an University of Science and Technology, Xi’an 710054, China
Yinyin Jiang: Xi’an Key Laboratory of Network Convergence Communication, School of Communication and Information Engineering, Xi’an University of Science and Technology, Xi’an 710054, China
Yan Wu: School of Electronics Engineering, Xidian University, Xi’an 710071, China

DOI: https://doi.org/10.3390/rs16203774
Journal volume & issue: Vol. 16, no. 20
p. 3774

Abstract

Read online

Generating pixel-level annotations for semantic segmentation tasks of high-resolution remote sensing images is both time-consuming and labor-intensive, which has led to increased interest in unsupervised methods. Therefore, in this paper, we propose an unsupervised multi-scale hybrid feature extraction network based on the CNN-Transformer architecture, referred to as MSHFE-Net. The MSHFE-Net consists of three main modules: a Multi-Scale Pixel-Guided CNN Encoder, a Multi-Scale Aggregation Transformer Encoder, and a Parallel Attention Fusion Module. The Multi-Scale Pixel-Guided CNN Encoder is designed for multi-scale, fine-grained feature extraction in unsupervised tasks, efficiently recovering local spatial information in images. Meanwhile, the Multi-Scale Aggregation Transformer Encoder introduces a multi-scale aggregation module, which further enhances the unsupervised acquisition of multi-scale contextual information, obtaining global features with stronger feature representation. The Parallel Attention Fusion Module employs an attention mechanism to fuse global and local features in both channel and spatial dimensions in parallel, enriching the semantic relations extracted during unsupervised training and improving the performance of unsupervised semantic segmentation. K-means clustering is then performed on the fused features to achieve high-precision unsupervised semantic segmentation. Experiments with MSHFE-Net on the Potsdam and Vaihingen datasets demonstrate its effectiveness in significantly improving the accuracy of unsupervised semantic segmentation.

Published in Remote Sensing

ISSN: 2072-4292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/remotesensing/

About the journal

Abstract

Keywords