IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)

MFINet: A Novel Zero-Shot Remote Sensing Scene Classification Network Based on Multimodal Feature Interaction

  • Xiaomeng Tan,
  • Bobo Xi,
  • Haitao Xu,
  • Yunsong Li,
  • Changbin Xue,
  • Jocelyn Chanussot

DOI
https://doi.org/10.1109/JSTARS.2024.3414499
Journal volume & issue
Vol. 17
pp. 11670 – 11684

Abstract

Read online

Zero-shot classification models aim to recognize image categories that are not included in the training phase by learning seen scenes with semantic information. This approach is particularly useful in remote sensing (RS) since it can identify previously unseen classes. However, most zero-shot RS scene classification approaches focus on matching visual and semantic features, while disregarding the importance of visual feature extraction, especially regarding local–global joint information. Furthermore, the visual and semantic relationships have not been thoroughly investigated due to the separate analysis of these features. To address these issues, we propose a novel zero-shot RS scene classification network based on multimodal feature interaction (MFINet). Specifically, the MFINet deploys hybrid image feature extraction networks, combining convolutional neural networks and an improved Transformer, to capture local discriminant information and long-range contextual information, respectively. Notably, we design a cross-modal feature fusion module to facilitate the MFINet, thereby enhancing relevant information in both the visual and semantic domains. Extensive experiments are conducted on the public zero-shot RS scene dataset, and the results consistently demonstrate that our proposed MFINet outperforms the state-of-the-art methods across various seen/unseen category ratios.

Keywords