IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2025)
SCDVit: Semantic Change Detection Based on Sam-Vit and Semantic Consistency
Abstract
In recent years, change detection has been a hot research topic in remote sensing. Previous research has focused on binary change detection (BCD), limiting its practical applications. Therefore, semantic change detection (SCD), which can detect multiple change classes, is gradually becoming a more mainstream task. Most existing SCD methods use convolutional neural networks as the backbone to extract multiscale features and use relatively simple decoder structures, leading to unsatisfactory detection accuracy. We propose a multitask network for SCD, and in the encoder, given the great success of segment anything module (SAM) and vision transformer (VIT) in the field of general-purpose segmentation task, we introduce SAM-VIT into the backbone to enhance the encoder's ability to capture long-range contextual semantic relationships. We propose a transformer-based decoder structure for the semantic segmentation branch to extract local and global features effectively. We propose a convolutional attention-based change extractor for the BCD branch to enhance temporal information fusion. Also, we analyze in detail the semantic inconsistency that affects the performance of SCD. First, we introduce contrastive loss to establish the correlation between the output features of the BCD branch and the segmentation branch. Second, we design a bitemporal graph semantic interaction module to maintain semantic consistency between the output features of the two segmentation branches; the module assigns pixels with different land cover types to the corresponding graph nodes based on clustering techniques and then uses cross-attention to model the correlation between bitemporal semantic features in the graph space. Finally, a self-learning training scheme based on pseudolabel further mitigates the problem of semantic inconsistency. SCDVit achieves state-of-the-art performance on two popular high-resolution datasets. Meanwhile, adequate quantitative and qualitative analyses highlight the potential of SAM-VIT for change detection and the effectiveness of the module designed based on semantic consistency.
Keywords