TextRS: Deep Bidirectional Triplet Network for Matching Text to Remote Sensing Images

Taghreed Abdullah; Yakoub Bazi; Mohamad  M. Al Rahhal; Mohamed  L. Mekhalfi; Lalitha Rangarajan; Mansour Zuair

doi:10.3390/rs12030405

Remote Sensing (Jan 2020)

TextRS: Deep Bidirectional Triplet Network for Matching Text to Remote Sensing Images

Taghreed Abdullah,
Yakoub Bazi,
Mohamad M. Al Rahhal,
Mohamed L. Mekhalfi,
Lalitha Rangarajan,
Mansour Zuair

Affiliations

Taghreed Abdullah: Department of Studies in Computer Science, University of Mysore, Manasagangothri, Mysore 570006, India
Yakoub Bazi: Computer Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
Mohamad M. Al Rahhal: Information System Department, College of Applied Computer Science, King Saud University, Riyadh 11543, Saudi Arabia
Mohamed L. Mekhalfi: Department of Information Engineering and Computer Science, University of Trento, Disi Via Sommarive 9, Povo, 28123 Trento, Italy
Lalitha Rangarajan: Department of Studies in Computer Science, University of Mysore, Manasagangothri, Mysore 570006, India
Mansour Zuair: Computer Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia

DOI: https://doi.org/10.3390/rs12030405
Journal volume & issue: Vol. 12, no. 3
p. 405

Abstract

Read online

Exploring the relevance between images and their respective natural language descriptions, due to its paramount importance, is regarded as the next frontier in the general computer vision literature. Thus, recently several works have attempted to map visual attributes onto their corresponding textual tenor with certain success. However, this line of research has not been widespread in the remote sensing community. On this point, our contribution is three-pronged. First, we construct a new dataset for text-image matching tasks, termed TextRS, by collecting images from four well-known different scene datasets, namely AID, Merced, PatternNet, and NWPU datasets. Each image is annotated by five different sentences. All the five sentences were allocated by five people to evidence the diversity. Second, we put forth a novel Deep Bidirectional Triplet Network (DBTN) for text to image matching. Unlike traditional remote sensing image-to-image retrieval, our paradigm seeks to carry out the retrieval by matching text to image representations. To achieve that, we propose to learn a bidirectional triplet network, which is composed of Long Short Term Memory network (LSTM) and pre-trained Convolutional Neural Networks (CNNs) based on (EfficientNet-B2, ResNet-50, Inception-v3, and VGG16). Third, we top the proposed architecture with an average fusion strategy to fuse the features pertaining to the five image sentences, which enables learning of more robust embedding. The performances of the method expressed in terms Recall@K representing the presence of the relevant image among the top K retrieved images to the query text shows promising results as it yields 17.20%, 51.39%, and 73.02% for K = 1, 5, and 10, respectively.

Published in Remote Sensing

ISSN: 2072-4292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/remotesensing/

About the journal

Abstract

Keywords