C3Net: Cross-Modal Feature Recalibrated, Cross-Scale Semantic Aggregated and Compact Network for Semantic Segmentation of Multi-Modal High-Resolution Aerial Images

Zhiying Cao; Wenhui Diao; Xian Sun; Xiaode Lyu; Menglong Yan; Kun Fu

doi:10.3390/rs13030528

Remote Sensing (Feb 2021)

C3Net: Cross-Modal Feature Recalibrated, Cross-Scale Semantic Aggregated and Compact Network for Semantic Segmentation of Multi-Modal High-Resolution Aerial Images

Zhiying Cao,
Wenhui Diao,
Xian Sun,
Xiaode Lyu,
Menglong Yan,
Kun Fu

Affiliations

Zhiying Cao: Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
Wenhui Diao: Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
Xian Sun: Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
Xiaode Lyu: Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
Menglong Yan: Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
Kun Fu: Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China

DOI: https://doi.org/10.3390/rs13030528
Journal volume & issue: Vol. 13, no. 3
p. 528

Abstract

Read online

Semantic segmentation of multi-modal remote sensing images is an important branch of remote sensing image interpretation. Multi-modal data has been proven to provide rich complementary information to deal with complex scenes. In recent years, semantic segmentation based on deep learning methods has made remarkable achievements. It is common to simply concatenate multi-modal data or use parallel branches to extract multi-modal features separately. However, most existing works ignore the effects of noise and redundant features from different modalities, which may not lead to satisfactory results. On the one hand, existing networks do not learn the complementary information of different modalities and suppress the mutual interference between different modalities, which may lead to a decrease in segmentation accuracy. On the other hand, the introduction of multi-modal data greatly increases the running time of the pixel-level dense prediction. In this work, we propose an efficient C3Net that strikes a balance between speed and accuracy. More specifically, C3Net contains several backbones for extracting features of different modalities. Then, a plug-and-play module is designed to effectively recalibrate and aggregate multi-modal features. In order to reduce the number of model parameters while remaining the model performance, we redesign the semantic contextual extraction module based on the lightweight convolutional groups. Besides, a multi-level knowledge distillation strategy is proposed to improve the performance of the compact model. Experiments on ISPRS Vaihingen dataset demonstrate the superior performance of C3Net with 15× fewer FLOPs than the state-of-the-art baseline network while providing comparable overall accuracy.

Published in Remote Sensing

ISSN: 2072-4292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/remotesensing/

About the journal

Abstract

Keywords