IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)
Masked Feature Modeling for Generative Self-Supervised Representation Learning of High-Resolution Remote Sensing Images
Abstract
Intelligent interpretation of remote sensing images using deep learning is heavily reliant on large datasets, and models trained in one domain often struggle with crossdomain application. Pretraining the backbone network via masked image modeling can effectively diminish this reliance on extensive sample data, thereby reducing crossdomain transfer obstacles. However, current masked image models typically employ a pure Transformer architecture, which may not fully capitalize on low-level features. To address these issues, this article proposes masked feature modeling (MFM), a methodology for the generative self-supervised learning of high-resolution remote sensing images that combines convolutional neural network (CNN) and Transformer architectures. This methodology has several advantages: 1) The hybrid CNN + Transformer architecture not only retains the advantages of the local feature representation of the CNN architecture but also has the full-text information modeling capabilities of the Transformer architecture; 2) the feature extraction network outputs multiscale features, and it is easier to add upsampling and a skip connection to improve the accuracy of the downstream dense prediction task; and 3) the pretrained MFM can be applied to various downstream tasks through fine-tuning with limited samples. The publicly available WHU and Massachusetts Building Datasets are used to verify the effectiveness of the proposed method. Extensive experiments involving main properties of the MFM for generative self-supervised learning, fine-tuning the MFM on the downstream semantic segmentation task, and comparisons with the other state-of-the-art generative self-supervised learning algorithms show that, through the combined advantages of the CNN and Transformer architectures, the proposed method has better feature extraction capability and higher accuracy on downstream tasks such as semantic segmentation.
Keywords