Enhanced semantic-positional feature fusion network via diverse pre-trained encoders for remote sensing image water-body segmentation

Zhili Zhang; Xiangyun Hu; Bingnan Yang; Kai Deng; Mi Zhang; Dehui Zhu

doi:10.1080/10095020.2024.2416898

Geo-spatial Information Science (Oct 2024)

Enhanced semantic-positional feature fusion network via diverse pre-trained encoders for remote sensing image water-body segmentation

Zhili Zhang,
Xiangyun Hu,
Bingnan Yang,
Kai Deng,
Mi Zhang,
Dehui Zhu

Affiliations

Zhili Zhang: School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China
Xiangyun Hu: School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China
Bingnan Yang: School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China
Kai Deng: School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China
Mi Zhang: School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China
Dehui Zhu: College of Electrical Science and Technology, National University of Defense Technology, Changsha, China

DOI: https://doi.org/10.1080/10095020.2024.2416898

Abstract

Read online

In the era of increasingly advanced Earth Observation (EO) technologies, extracting pertinent information (such as water-bodies) from the Earth’s surface has become a crucial task. Deep Learning, especially via pre-trained models, currently offers a highly promising approach for the semantic segmentation of Remote Sensing Imagery (RSI). However, effectively adapting these pre-trained models to RSI tasks remains challenging. Typically, these models undergo fine-tuning for specialized tasks, involving modifications to their parameters or structure of the original architecture, which may impact their inherent generalization capabilities. Furthermore, robust pre-trained models on nature images are not specifically designed for RSI, presenting challenges in their direct application to RSI tasks. To alleviate these problems, our study introduces a light-weight Enhanced Semantic-positional Feature Fusion Network (ESFFNet), leveraging diverse pre-trained image encoders alongside extensive EO data. The proposed method begins by leveraging pre-trained encoders, specifically Vision Transformer (ViT)-based and Convolutional Neural Network (CNN)-based models, to extract deep semantic and precise positional features respectively, without additional training. Following this, we introduce the Enhanced Semantic-positional Feature Fusion Module (ESFFM). This module adeptly merges semantic features derived from the ViT-based encoder with spatial features extracted from the CNN-based encoder. Such integration is realized via multi-scale feature fusion, local and long-distance feature integration, and dense connectivity strategies, leading to a robust feature representation. Finally, the Primary Segmentation-guided Fine Extraction Module (PSFEM) further bolsters the precision of remote sensing image segmentation. Collectively, these two modules constitute our light-weight decoder, with a parameter size of less than 4 M. Our approach is evaluated on two distinct water-body datasets, indicating superiority over other leading segmentation techniques. In addition, our method also demonstrates exemplary efficacy in diverse remote sensing segmentation tasks, such as building extraction and land cover classification. The source codes will be available at https://github.com/zhilyzhang/ESFFNet.

Published in Geo-spatial Information Science

ISSN: 1009-5020 (Print); 1993-5153 (Online)
Publisher: Taylor & Francis Group
Country of publisher: United Kingdom
LCC subjects: Geography. Anthropology. Recreation: Mathematical geography. Cartography; Science: Astronomy: Geodesy
Website: https://www.tandfonline.com/journals/tgsi

About the journal

Abstract

Keywords