Contrasting Dual Transformer Architectures for Multi-Modal Remote Sensing Image Retrieval

Mohamad M. Al Rahhal; Mohamed Abdelkader Bencherif; Yakoub Bazi; Abdullah Alharbi; Mohamed Lamine Mekhalfi

doi:10.3390/app13010282

Applied Sciences (Dec 2022)

Contrasting Dual Transformer Architectures for Multi-Modal Remote Sensing Image Retrieval

Mohamad M. Al Rahhal,
Mohamed Abdelkader Bencherif,
Yakoub Bazi,
Abdullah Alharbi,
Mohamed Lamine Mekhalfi

Affiliations

Mohamad M. Al Rahhal: Applied Computer Science Department, College of Applied Computer Science, King Saud University, Riyadh 11543, Saudi Arabia
Mohamed Abdelkader Bencherif: Center of Smart Robotics Research, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
Yakoub Bazi: Computer Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
Abdullah Alharbi: Department of Computer Science, Community College, King Saud University, Riyadh 11437, Saudi Arabia
Mohamed Lamine Mekhalfi: Digital Industry Center, Technologies of Vision Unit, Fondazione Bruno Kessler, 38123 Trento, Italy

DOI: https://doi.org/10.3390/app13010282
Journal volume & issue: Vol. 13, no. 1
p. 282

Abstract

Read online

Remote sensing technology has advanced rapidly in recent years. Because of the deployment of quantitative and qualitative sensors, as well as the evolution of powerful hardware and software platforms, it powers a wide range of civilian and military applications. This in turn leads to the availability of large data volumes suitable for a broad range of applications such as monitoring climate change. Yet, processing, retrieving, and mining large data are challenging. Usually, content-based remote sensing image (RS) retrieval approaches rely on a query image to retrieve relevant images from the dataset. To increase the flexibility of the retrieval experience, cross-modal representations based on text–image pairs are gaining popularity. Indeed, combining text and image domains is regarded as one of the next frontiers in RS image retrieval. Yet, aligning text to the content of RS images is particularly challenging due to the visual-sematic discrepancy between language and vision worlds. In this work, we propose different architectures based on vision and language transformers for text-to-image and image-to-text retrieval. Extensive experimental results on four different datasets, namely TextRS, Merced, Sydney, and RSICD datasets are reported and discussed.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords