Remote Sensing (May 2024)

TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning

  • Jie Guo,
  • Ze Li,
  • Bin Song,
  • Yuhao Chi

DOI
https://doi.org/10.3390/rs16111843
Journal volume & issue
Vol. 16, no. 11
p. 1843

Abstract

Read online

In the field of remote sensing image captioning (RSIC), mainstream methods typically adopt an encoder–decoder framework. Methods based on this framework often use only simple feature fusion strategies, failing to fully mine the fine-grained features of the remote sensing image. Moreover, the lack of context information introduction in the decoder results in less accurate generated sentences. To address these problems, we propose a two-stage feature enhancement model (TSFE) for remote sensing image captioning. In the first stage, we adopt an adaptive feature fusion strategy to acquire multi-scale features. In the second stage, we further mine fine-grained features based on multi-scale features by establishing associations between different regions of the image. In addition, we introduce global features with scene information in the decoder to help generate descriptions. Experimental results on the RSICD, UCM-Captions, and Sydney-Captions datasets demonstrate that the proposed method outperforms existing state-of-the-art approaches.

Keywords