Geo-spatial Information Science (Oct 2024)
Enhanced semantic-positional feature fusion network via diverse pre-trained encoders for remote sensing image water-body segmentation
Abstract
In the era of increasingly advanced Earth Observation (EO) technologies, extracting pertinent information (such as water-bodies) from the Earth’s surface has become a crucial task. Deep Learning, especially via pre-trained models, currently offers a highly promising approach for the semantic segmentation of Remote Sensing Imagery (RSI). However, effectively adapting these pre-trained models to RSI tasks remains challenging. Typically, these models undergo fine-tuning for specialized tasks, involving modifications to their parameters or structure of the original architecture, which may impact their inherent generalization capabilities. Furthermore, robust pre-trained models on nature images are not specifically designed for RSI, presenting challenges in their direct application to RSI tasks. To alleviate these problems, our study introduces a light-weight Enhanced Semantic-positional Feature Fusion Network (ESFFNet), leveraging diverse pre-trained image encoders alongside extensive EO data. The proposed method begins by leveraging pre-trained encoders, specifically Vision Transformer (ViT)-based and Convolutional Neural Network (CNN)-based models, to extract deep semantic and precise positional features respectively, without additional training. Following this, we introduce the Enhanced Semantic-positional Feature Fusion Module (ESFFM). This module adeptly merges semantic features derived from the ViT-based encoder with spatial features extracted from the CNN-based encoder. Such integration is realized via multi-scale feature fusion, local and long-distance feature integration, and dense connectivity strategies, leading to a robust feature representation. Finally, the Primary Segmentation-guided Fine Extraction Module (PSFEM) further bolsters the precision of remote sensing image segmentation. Collectively, these two modules constitute our light-weight decoder, with a parameter size of less than 4 M. Our approach is evaluated on two distinct water-body datasets, indicating superiority over other leading segmentation techniques. In addition, our method also demonstrates exemplary efficacy in diverse remote sensing segmentation tasks, such as building extraction and land cover classification. The source codes will be available at https://github.com/zhilyzhang/ESFFNet.
Keywords