Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning

Zhengxin Li; Wenzhe Zhao; Xuanyi Du; Guangyao Zhou; Songlin Zhang

doi:10.3390/rs16010196

Remote Sensing (Jan 2024)

Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning

Zhengxin Li,
Wenzhe Zhao,
Xuanyi Du,
Guangyao Zhou,
Songlin Zhang

Affiliations

Zhengxin Li: The Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
Wenzhe Zhao: The Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
Xuanyi Du: The Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
Guangyao Zhou: The Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
Songlin Zhang: The Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China

DOI: https://doi.org/10.3390/rs16010196
Journal volume & issue: Vol. 16, no. 1
p. 196

Abstract

Read online

Two-stage remote sensing image captioning (RSIC) methods have achieved promising results by incorporating additional pre-trained remote sensing tasks to extract supplementary information and improve caption quality. However, these methods face limitations in semantic comprehension, as pre-trained detectors/classifiers are constrained by predefined labels, leading to an oversight of the intricate and diverse details present in remote sensing images (RSIs). Additionally, the handling of auxiliary remote sensing tasks separately can introduce challenges in ensuring seamless integration and alignment with the captioning process. To address these problems, we propose a novel cross-modal retrieval and semantic refinement (CRSR) RSIC method. Specifically, we employ a cross-modal retrieval model to retrieve relevant sentences of each image. The words in these retrieved sentences are then considered as primary semantic information, providing valuable supplementary information for the captioning process. To further enhance the quality of the captions, we introduce a semantic refinement module that refines the primary semantic information, which helps to filter out misleading information and emphasize visually salient semantic information. A Transformer Mapper network is introduced to expand the representation of image features beyond the retrieved supplementary information with learnable queries. Both the refined semantic tokens and visual features are integrated and fed into a cross-modal decoder for caption generation. Through extensive experiments, we demonstrate the superiority of our CRSR method over existing state-of-the-art approaches on the RSICD, the UCM-Captions, and the Sydney-Captions datasets

Published in Remote Sensing

ISSN: 2072-4292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/remotesensing/

About the journal

Abstract

Keywords