A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning

Dongwei Sun; Yajie Bao; Junmin Liu; Xiangyong Cao

doi:10.1109/JSTARS.2024.3471625

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)

A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning

Dongwei Sun,
Yajie Bao,
Junmin Liu,
Xiangyong Cao

Affiliations

Dongwei Sun: ORCiD; School of Computer Science and Technology and the Ministry of Education Key Lab for Intelligent Networks and Network Security, Xi'an Jiaotong University, Xi'an, China
Yajie Bao: School of Computer Science and Technology and the Ministry of Education Key Lab for Intelligent Networks and Network Security, Xi'an Jiaotong University, Xi'an, China
Junmin Liu: ORCiD; Department of Information Science, School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, China
Xiangyong Cao: ORCiD; School of Computer Science and Technology and the Ministry of Education Key Lab for Intelligent Networks and Network Security, Xi'an Jiaotong University, Xi'an, China

DOI: https://doi.org/10.1109/JSTARS.2024.3471625
Journal volume & issue: Vol. 17
pp. 18727 – 18738

Abstract

Read online

Remote sensing image change captioning (RSICC) aims to automatically generate sentences that describe content differences in remote sensing bitemporal images. Recently, attention-based transformers have become a prevalent idea for capturing the features of global change. However, existing transformer-based RSICC methods face challenges, e.g., high parameters and high computational complexity caused by the self-attention operation in the transformer encoder component. To alleviate these issues, this article proposes a sparse focus transformer (SFT) for the RSICC task. Specifically, the SFT network consists of three main components, i.e., a high-level features extractor based on a convolutional neural network, a sparse focus attention mechanism-based transformer encoder network designed to locate and capture changing regions in dual-temporal images, and a description decoder that embeds images and words to generate sentences for captioning differences. The proposed SFT network can reduce the parameter number and computational complexity by incorporating a sparse attention mechanism within the transformer encoder network. Experimental results on various datasets demonstrate that even with a reduction of over 90% in parameters and computational complexity for the transformer encoder, our proposed network can still obtain competitive performance compared to other state-of-the-art RSICC methods.

Published in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

ISSN: 1939-1404 (Print); 2151-1535 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Ocean engineering; Science: Physics: Geophysics. Cosmic physics
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=4609443

About the journal

Abstract

Keywords