Enhanced CLIP-GPT Framework for Cross-Lingual Remote Sensing Image Captioning

Rui Song; Beigeng Zhao; Lizhi Yu

doi:10.1109/ACCESS.2024.3522585

IEEE Access (Jan 2025)

Enhanced CLIP-GPT Framework for Cross-Lingual Remote Sensing Image Captioning

Rui Song,
Beigeng Zhao,
Lizhi Yu

Affiliations

Rui Song: College of Public Security Information Technology and Intelligence, Criminal Investigation Police University of China, Shenyang, China
Beigeng Zhao: College of Public Security Information Technology and Intelligence, Criminal Investigation Police University of China, Shenyang, China
Lizhi Yu: Yuhong Sub-Bureau, Shenyang Public Security Bureau, Shenyang, China

DOI: https://doi.org/10.1109/ACCESS.2024.3522585
Journal volume & issue: Vol. 13
pp. 904 – 915

Abstract

Read online

Remote Sensing Image Captioning (RSIC) aims to generate precise and informative descriptive text for remote sensing images using computational algorithms. Traditional “encoder-decoder” approaches face limitations due to their high training costs and heavy reliance on large-scale annotated datasets, hindering their practical applications. To address these challenges, we propose a lightweight solution based on an enhanced CLIP-GPT framework. Our approach utilizes CLIP for zero-shot multimodal feature extraction of remote sensing images, followed by the design and optimization of a mapping network based on an improved Transformer with adaptive multi-head attention to align these features with the text space of GPT-2, facilitating the generation of high-quality descriptive text. Experimental results on the Sydney-captions, UCM-captions, and RSICD datasets demonstrate that the proposed mapping network outperforms existing methods in leveraging CLIP-extracted multimodal features, leading to more accurate and stylistically appropriate text generated by the GPT language model. Furthermore, our method achieves comparable or superior performance to traditional “encoder-decoder” baselines in terms of BLEU, CIDEr, and METEOR metrics, while requiring only one-fifth of the training time. Experiments conducted on an additional Chinese-English bilingual RSIC dataset underscore the distinct advantages of our CLIP-GPT framework, which leverages extensive multimodal pre-training to effectively demonstrate the robust potential of this approach in cross-lingual RSIC tasks.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords