IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2021)

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

  • Qimin Cheng,
  • Yuzhuo Zhou,
  • Peng Fu,
  • Yuan Xu,
  • Liang Zhang

DOI
https://doi.org/10.1109/JSTARS.2021.3070872
Journal volume & issue
Vol. 14
pp. 4284 – 4297

Abstract

Read online

Because of the rapid growth of multimodal data from the internet and social media, a cross-modal retrieval has become an important and valuable task in recent years.The purpose of the cross-modal retrieval is to obtain the result data in one modality (e.g., image), which is semantically similar to the query data in another modality (e.g., text).In the field of remote sensing, despite a great number of existing works on image retrieval, there has only been a small amount of research on the cross-modal image-text retrieval, due to the scarcity of datasets and the complicated characteristics of remote sensing image data. In this article, we introduce a novel cross-modal image-text retrieval network to establish the direct relationship between remote sensing images and their paired text data. Specifically, in our framework, we designed a semantic alignment module to fully explore the latent correspondence between images and text, in which we used the attention and gate mechanisms to filter and optimize data features so that more discriminative feature representations can be obtained. Experimental results on four benchmark remote sensing datasets, including UCMerced-LandUse-Captions, Sydney-Captions, RSICD, and NWPU-RESISC45-Captions, well showed that our proposed method outperformed other baselines and achieved the state-of-the-art performance in remote sensing image-text retrieval tasks.

Keywords