VCC-DiffNet: Visual Conditional Control Diffusion Network for Remote Sensing Image Captioning

Qimin Cheng; Yuqi Xu; Ziyang Huang

doi:10.3390/rs16162961

Remote Sensing (Aug 2024)

VCC-DiffNet: Visual Conditional Control Diffusion Network for Remote Sensing Image Captioning

Qimin Cheng,
Yuqi Xu,
Ziyang Huang

Affiliations

Qimin Cheng: School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China
Yuqi Xu: School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China
Ziyang Huang: School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China

DOI: https://doi.org/10.3390/rs16162961
Journal volume & issue: Vol. 16, no. 16
p. 2961

Abstract

Read online

Pioneering remote sensing image captioning (RSIC) works use autoregressive decoding for fluent and coherent sentences but suffer from high latency and high computation costs. In contrast, non-autoregressive approaches improve inference speed by predicting multiple tokens simultaneously, though at the cost of performance due to a lack of sequential dependencies. Recently, diffusion model-based non-autoregressive decoding has shown promise in natural image captioning with iterative refinement, but its effectiveness is limited by the intrinsic characteristics of remote sensing images, which complicate robust input construction and affect the description accuracy. To overcome these challenges, we propose an innovative diffusion model for RSIC, named the Visual Conditional Control Diffusion Network (VCC-DiffNet). Specifically, we propose a Refined Multi-scale Feature Extraction (RMFE) module to extract the discernible visual context features of RSIs as input of the diffusion model-based non-autoregressive decoder to conditionally control a multi-step denoising process. Furthermore, we propose an Interactive Enhanced Decoder (IE-Decoder) utilizing dual image–description interactions to generate descriptions finely aligned with the image content. Experiments conducted on four representative RSIC datasets demonstrate that our non-autoregressive VCC-DiffNet performs comparably to, or even better than, popular autoregressive baselines in classic metrics, achieving around an 8.22× speedup in Sydney-Captions, an 11.61× speedup in UCM-Captions, a 15.20× speedup in RSICD, and an 8.13× speedup in NWPU-Captions.

Published in Remote Sensing

ISSN: 2072-4292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/remotesensing/

About the journal

Abstract

Keywords