IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2025)
LMF-Net: A Learnable Multimodal Fusion Network for Semantic Segmentation of Remote Sensing Data
Abstract
Semantic segmentation of remote sensing images has produced a significant effect on many applications, such as land cover, land use, and smoke detection. With the ever-growing remote sensing data, fusing multimodal data from different sensors is a feasible and effective scheme for semantic segmentation task. Deep learning technology has prominently promoted the development of semantic segmentation. However, the majority of current approaches commonly focus more on feature mixing and construct relatively complex architectures. The further mining for cross-modal features is comparatively insufficient in heterogeneous data fusion. In addition, complex structures also lead to relatively heavy computation burden. Therefore, in this article, we propose an end-to-end learnable multimodal fusion network (LMF-Net) for remote sensing semantic segmentation. Concretely, we first develop a multiscale pooling fusion module by leveraging pooling operator. It provides key-value pairs with multimodal complementary information in a parameter-free manner and assigns them to self-attention (SA) layers of different modal branches. Then, to further harness the cross-modal collaborative embeddings/features, we elaborate two learnable fusion modules, learnable embedding fusion and learnable feature fusion. They are able to dynamically adjust the collaborative relationships of different modal embeddings and features in a learnable approach, respectively. Experiments on two well-established benchmark datasets reveal that our LMF-Net possesses superior segmentation behavior and strong generalization capability. In terms of computation complexity, it achieves competitive performance as well. Ultimately, the contribution of each component involved in LMF-Net is evaluated and discussed in detail.
Keywords