IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2023)

From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning

  • Runyan Du,
  • Wei Cao,
  • Wenkai Zhang,
  • Guo Zhi,
  • Xian Sun,
  • Shuoke Li,
  • Jihao Li

DOI
https://doi.org/10.1109/JSTARS.2023.3305889
Journal volume & issue
Vol. 16
pp. 7704 – 7717

Abstract

Read online

With the growth of remote sensing images, understanding image content automatically has attracted many researchers' interests in deep learning for remote sensing image. Inspired from the natural image captioning, the model with convolutional neural network (CNN)-Recurrent neural network (RNN) as the backbone and supplemented by attention has been widely used in remote sensing image captioning. However, it is inefficient for the current attention layer to simultaneously mine hidden foreground from the background of remote sensing image and perform feature interactive learning. Meanwhile, the new mainstream language model has recently surpassed the traditional long short-term memory (LSTM) in sentence generation. For solving the above problems, in this article, we proposed a novel thought to make the flat remote sensing images stereoscopic by separating the foreground and background. Based on hierarchical image information, we designed a novel Deformable Transformer equipped with deformable scaled dot-product attention to learn multiscale feature from foreground and background through the powerful interactive learning ability. Evaluations are conducted on four classic remote sensing image captioning datasets. Compared with the state-of-the-art methods, our Transformer variant achieves higher captioning accuracy.

Keywords