IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)
Consistency Regularization Based on Masked Image Modeling for Semisupervised Remote Sensing Semantic Segmentation
Abstract
Semisupervised semantic segmentation aims to effectively leverage both unlabeled and scare labeled images, reducing the reliance on labor-intensive pixel-level labeling for extensive training processes. The leading semisupervised learning method, consistency regularization, employs weak and strong data augmentations to diversify input representations. Ultimately the model is compelled to maintain consistent predictions across different input views, thus boosting the model's generalization. However, previous methods suffered from limited input representation space introduced by linear transformations such as cutmix. To address such issue, a consistency regularization based on masked image modeling (MIM) called MIMSeg is proposed to achieve accurate segmentation with limited labeled images. First, MIM pixel-wise perception with ViT encoder-decoder lays the foundation for expanding the data representation space. Second, collaborating with weak data augmentations, two MIM-related strong data augmentations are developed to generate more challenging input views for consistent predictions. Precisely, weak data augmentations are employed to replicate input views from various perspectives while a controllable generative strong data augmentation called masked image reconstruction (MIR) is crafted to simulate multiple imaging diversity while preserving the original semantic information intact. In addition, a more severe strong data augmentation masked context perturbation (MCP) is designed to further generate more challenging input views and alleviate semantic deficiency via masked category prediction. Leveraging the MIM perception and two MIM-related strong data augmentations, the model is compelled to achieve consistency predictions across diverse input views from weak data augmentations, MIR and MCP. These components result in the generation of more stable pixel-level pseudo-labels and facilitate collaborative training between unlabeled and labeled images. Extensive experiments have shown that MIMSeg can achieve state-of-the-art performance in pixel-level prediction with very limited sample annotations.
Keywords