International Journal of Digital Earth (Dec 2024)
FRIC: a framework for few-shot remote sensing image captioning
Abstract
ABSTRACTThe training of image captioning (IC) models requires a large number of caption-labeled samples, which is usually difficult to satisfy in the actual remote sensing scenarios. The performance of the models will be damaged due to the few-shot problems. We describe the few-shot problems in remote sensing image captioning (RC) and design two research schemes. Then, we propose a few-shot RC framework few-shot remote sensing image captioning framework (FRIC). FRIC does not need additional samples and uses a simple base model. FRIC tries to get performance boosts from split samples and reduce the negative effects of noises. Unlike previous works that use 100% samples to simulate few-shot scenarios, FRIC uses less than 1.0% data to simulate actual few-shot scenarios. While previous works focus on improving the encoder, FRIC focuses on optimizing the decoder with parameter ensemble, multi-model ensemble and self-distillation. FRIC can train a simple base model with limited caption-labeled samples to generate captions that meet human expectations. FRIC shows obvious advantages to other methods when trained with only 0.8% samples of RC datasets. No previous work has used such a small amount of data to train the RC model. In addition, the effectiveness of the components in FRIC is verified with ablation experiments.
Keywords