International Journal of Applied Earth Observations and Geoinformation (May 2025)

CTSeg: CNN and ViT collaborated segmentation framework for efficient land-use/land-cover mapping with high-resolution remote sensing images

  • Jifa Chen,
  • Gang Chen,
  • Pin Zhou,
  • Yufeng He,
  • Lianzhe Yue,
  • Mingjun Ding,
  • Hui Lin

DOI
https://doi.org/10.1016/j.jag.2025.104546
Journal volume & issue
Vol. 139
p. 104546

Abstract

Read online

Semantic segmentation models present significant work in land-use/land-cover (LULC) mapping. Even though vision transformers (ViT) with long-sequence interactions have recently emerged as popular solutions alongside convolutional neural networks (CNN), they remain less effective for high-resolution remote sensing data characterized by small volumes and rich heterogeneities. In this paper, we propose a novel CNN and ViT collaborated segmentation framework (CTSeg) to address these weaknesses. Following the encoder-decoder architecture, we first introduce an encoding backbone with multifarious attention mechanisms to respectively capture global and local contexts. It is designed with parallel dual branches where the position-relation aggregation (PRA) blocks and others with channel relations (CRA) form the CNN-based encoding module, whereas the ViT-based one comprises multi-stage window-shifted transformer (WST) blocks with cross-window interactions. We further explore the online knowledge distillation presented with pixel-wise and channel-wise feature distillation modules to facilitate bidirectional learning between the CNN and ViT backbones, supported by a well-designed loss decay strategy. In addition, we develop a multiscale feature decoding module to produce more high-quality segmentation predictions where the leveraged correlation-weighted fusions emphasize the heterogeneous feature representations. Extensive comparison and ablation studies on two benchmark datasets demonstrate its competitive performance in efficient LULC mapping.

Keywords