Visual Rotated Position Encoding Transformer for Remote Sensing Image Captioning

Anli Liu; Lingwu Meng; Liang Xiao

doi:10.1109/JSTARS.2024.3487846

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)

Visual Rotated Position Encoding Transformer for Remote Sensing Image Captioning

Anli Liu,
Lingwu Meng,
Liang Xiao

Affiliations

Anli Liu: ORCiD; School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
Lingwu Meng: ORCiD; School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
Liang Xiao: ORCiD; Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education and the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China

DOI: https://doi.org/10.1109/JSTARS.2024.3487846
Journal volume & issue: Vol. 17
pp. 20026 – 20040

Abstract

Read online

Remote sensing image captioning (RSIC) is a crucial task in interpreting remote sensing images (RSIs), as it involves describing their content using clear and precise natural language. However, the RSIC encounters difficulties due to the intricate structure and distinctive features of the images, such as the issue of rotational ambiguity. The existence of visually alike objects or areas can result in misidentification. In addition, prioritizing groups of objects with strong relational ties during the captioning process poses a significant challenge. To address these challenges, we propose the visual rotated position encoding transformer for RSIC. First of all, rotation-invariant features and global features are extracted using a multilevel feature extraction (MFE) module. To focus on closely related rotated objects, we design a visual rotated position encoding module, which is incorporated into the transformer encoder to model directional relationships between objects. To distinguish similar features and guide caption generation, we propose a feature enhancement fusion module consisting of feature enhancement and feature fusion. The feature enhancement component adopts a self-attention mechanism to construct fully connected graphs for object features. The feature fusion component integrates global features and word vectors to guide the caption generation process. In addition, we construct an RSI rotated object detection dataset RSIC-ROD and pretrain a rotated object detector. The proposed method demonstrates significant performance improvements on four datasets, showcasing enhanced capabilities in preserving descriptive details, distinguishing similar objects, and accurately capturing object relationships.

Published in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

ISSN: 1939-1404 (Print); 2151-1535 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Ocean engineering; Science: Physics: Geophysics. Cosmic physics
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=4609443

About the journal

Abstract

Keywords