Multilanguage Transformer for Improved Text to Remote Sensing Image Retrieval

Mohamad M. Al Rahhal; Yakoub Bazi; Norah A. Alsharif; Laila Bashmal; Naif Alajlan; Farid Melgani

doi:10.1109/JSTARS.2022.3215803

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2022)

Multilanguage Transformer for Improved Text to Remote Sensing Image Retrieval

Mohamad M. Al Rahhal,
Yakoub Bazi,
Norah A. Alsharif,
Laila Bashmal,
Naif Alajlan,
Farid Melgani

Affiliations

Mohamad M. Al Rahhal: ORCiD; Applied Computer Science Department, College of Applied Computer Science, King Saud University, Riyadh, Saudi Arabia
Yakoub Bazi: ORCiD; Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
Norah A. Alsharif: ORCiD; Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
Laila Bashmal: Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
Naif Alajlan: ORCiD; Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
Farid Melgani: ORCiD; Department of Information Engineering and Computer Science, University of Trento, Trento, Italy

DOI: https://doi.org/10.1109/JSTARS.2022.3215803
Journal volume & issue: Vol. 15
pp. 9115 – 9126

Abstract

Read online

Cross-modal text-image retrieval in remote sensing (RS) provides a flexible retrieval experience for mining useful information from RS repositories. However, existing methods are designed to accept queries formulated in the English language only, which may restrict accessibility to useful information for non-English speakers. Allowing multilanguage queries can enhance the communication with the retrieval system and broaden access to the RS information. To address this limitation, this article proposes a multilanguage framework based on transformers. Specifically, our framework is composed of two transformer encoders for learning modality-specific representations, the first is a language encoder for generating language representation features from the textual description, while the second is a vision encoder for extracting visual features from the corresponding image. The two encoders are trained jointly on image and text pairs by minimizing a bidirectional contrastive loss. To enable the model to understand queries in multiple languages, we trained it on descriptions from four different languages, namely, English, Arabic, French, and Italian. The experimental results on three benchmark datasets (i.e., RSITMD, RSICD, and UCM) demonstrate that the proposed model improves significantly the retrieval performances in terms of recall compared to the existing state-of-the-art RS retrieval methods.

Published in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

ISSN: 1939-1404 (Print); 2151-1535 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Ocean engineering; Science: Physics: Geophysics. Cosmic physics
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=4609443

About the journal

Abstract

Keywords