International Journal of Digital Earth (Dec 2023)

MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images

  • Haiyan Huang,
  • Zhenfeng Shao,
  • Qimin Cheng,
  • Xiao Huang,
  • Xiaoping Wu,
  • Guoming Li,
  • Li Tan

DOI
https://doi.org/10.1080/17538947.2023.2283482
Journal volume & issue
Vol. 16, no. 2
pp. 4848 – 4866

Abstract

Read online

ABSTRACTRemote Sensing Image Captioning (RSIC) plays a crucial role in advancing semantic understanding and has increasingly become a focal point of research. Nevertheless, existing RSIC methods grapple with challenges due to the intricate multi-scale nature and multifaceted backgrounds inherent in Remote Sensing Images (RSIs). Compounding these challenges are the perceptible information disparities across diverse modalities. In response to these challenges, we propose a novel multi-scale contextual information aggregation image captioning network (MC-Net). This network incorporates an image encoder enhanced with a multi-scale feature extraction module, a feature fusion module, and a finely tuned adaptive decoder equipped with a visual-text alignment module. Notably, MC-Net possesses the capability to extract informative multiscale features, facilitated by the multilayer perceptron and transformer. We also introduce an adaptive gating mechanism during the decoding phase to ensure precise alignment between visual regions and their corresponding text descriptions. Empirical studies conducted on four publicly recognized cross-modal datasets unequivocally demonstrate the superior robustness and efficacy of MC-Net in comparison to contemporaneous RSIC methods.

Keywords