An Enhanced Feature Extraction Framework for Cross-Modal Image–Text Retrieval

Jinzhi Zhang; Luyao Wang; Fuzhong Zheng; Xu Wang; Haisu Zhang

doi:10.3390/rs16122201

Remote Sensing (Jun 2024)

An Enhanced Feature Extraction Framework for Cross-Modal Image–Text Retrieval

Jinzhi Zhang,
Luyao Wang,
Fuzhong Zheng,
Xu Wang,
Haisu Zhang

Affiliations

Jinzhi Zhang: School of Information and Communication, National University of Defense Technology, Wuhan 430030, China
Luyao Wang: School of Information and Communication, National University of Defense Technology, Wuhan 430030, China
Fuzhong Zheng: School of Information and Communication, National University of Defense Technology, Wuhan 430030, China
Xu Wang: School of Information and Communication, National University of Defense Technology, Wuhan 430030, China
Haisu Zhang: School of Information and Communication, National University of Defense Technology, Wuhan 430030, China

DOI: https://doi.org/10.3390/rs16122201
Journal volume & issue: Vol. 16, no. 12
p. 2201

Abstract

Read online

In general, remote sensing images depict intricate scenes. In cross-modal retrieval tasks involving remote sensing images, the accompanying text includes numerus information with an emphasis on mainly large objects due to higher attention, and the features from small targets are often omitted naturally. While the conventional vision transformer (ViT) method adeptly captures information regarding large global targets, its capability to extract features of small targets is limited. This limitation stems from the constrained receptive field in ViT’s self-attention layer, which hinders the extraction of information pertaining to small targets due to interference from large targets. To address this concern, this study introduces a patch classification framework based on feature similarity, which establishes distinct receptive fields in the feature space to mitigate interference from large targets on small ones, thereby enhancing the ability of traditional ViT to extract features from small targets. We conducted evaluation experiments on two popular datasets—the Remote Sensing Image–Text Match Dataset (RSITMD) and the Remote Sensing Image Captioning Dataset (RSICD)—resulting in mR indices of 35.6% and 19.47%, respectively. The proposed approach contributes to improving the detection accuracy of small targets and can be applied to more complex image–text retrieval tasks involving multi-scale ground objects.

Published in Remote Sensing

ISSN: 2072-4292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/remotesensing/

About the journal

Abstract

Keywords