Remote Sensing Image Change Captioning Using Multi-Attentive Network with Diffusion Model

Yue Yang; Tie Liu; Ying Pu; Liangchen Liu; Qijun Zhao; Qun Wan

doi:10.3390/rs16214083

Remote Sensing (Nov 2024)

Remote Sensing Image Change Captioning Using Multi-Attentive Network with Diffusion Model

Yue Yang,
Tie Liu,
Ying Pu,
Liangchen Liu,
Qijun Zhao,
Qun Wan

Affiliations

Yue Yang: College of Computer Science, Sichuan University, Chengdu 610025, China
Tie Liu: College of Computer Science, Sichuan University, Chengdu 610025, China
Ying Pu: College of Computer Science, Sichuan University, Chengdu 610025, China
Liangchen Liu: College of Computer Science, Sichuan University, Chengdu 610025, China
Qijun Zhao: College of Computer Science, Sichuan University, Chengdu 610025, China
Qun Wan: School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

DOI: https://doi.org/10.3390/rs16214083
Journal volume & issue: Vol. 16, no. 21
p. 4083

Abstract

Read online

Remote sensing image change captioning (RSICC) has received considerable research interest due to its ability of automatically providing meaningful sentences describing the changes in remote sensing (RS) images. Existing RSICC methods mainly utilize pre-trained networks on natural image datasets to extract feature representations. This degrades performance since aerial images possess distinctive characteristics compared to natural images. In addition, it is challenging to capture the data distribution and perceive contextual information between samples, resulting in limited robustness and generalization of the feature representations. Furthermore, their focus on inherent most change-aware discriminative information is insufficient by directly aggregating all features. To deal with these problems, a novel framework entitled Multi-Attentive network with Diffusion model for RSICC (MADiffCC) is proposed in this work. Specifically, we introduce a diffusion feature extractor based on RS image dataset pre-trained diffusion model to capture the multi-level and multi-time-step feature representations of bitemporal RS images. The diffusion model is able to learn the training data distribution and contextual information of RS objects from which more robust and generalized representations could be extracted for the downstream application of change captioning. Furthermore, a time-channel-spatial attention (TCSA) mechanism based difference encoder is designed to utilize the extracted diffusion features to obtain the discriminative information. A gated multi-head cross-attention (GMCA)-guided change captioning decoder is then proposed to select and fuse crucial hierarchical features for more precise change description generation. Experimental results on the publicly available LEVIR-CC, LEVIRCCD, and DUBAI-CC datasets verify that the developed approach could realize state-of-the-art (SOTA) performance.

Published in Remote Sensing

ISSN: 2072-4292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/remotesensing/

About the journal

Abstract

Keywords