Remote Sensing (Sep 2022)
Self-Learning for Few-Shot Remote Sensing Image Captioning
Abstract
Large-scale caption-labeled remote sensing image samples are expensive to acquire, and the training samples available in practical application scenarios are generally limited. Therefore, remote sensing image caption generation tasks will inevitably fall into the dilemma of few-shot, resulting in poor qualities of the generated text descriptions. In this study, we propose a self-learning method named SFRC for few-shot remote sensing image captioning. Without relying on additional labeled samples and external knowledge, SFRC improves the performance in few-shot scenarios by ameliorating the way and efficiency of the method of learning on limited data. We first train an encoder for semantic feature extraction using a supplemental modified BYOL self-supervised learning method on a small number of unlabeled remote sensing samples, where the unlabeled remote sensing samples are derived from caption-labeled samples. When training the model for caption generation in a small number of caption-labeled remote sensing samples, the self-ensemble yields a parameter-averaging teacher model based on the integration of intermediate morphologies of the model over a certain training time horizon. The self-distillation uses the self-ensemble-obtained teacher model to generate pseudo labels to guide the student model in the next generation to achieve better performance. Additionally, when optimizing the model by parameter back-propagation, we design a baseline incorporating self-critical self-ensemble to reduce the variance during gradient computation and weaken the effect of overfitting. In a range of experiments only using limited caption-labeled samples, the performance evaluation metric scores of SFRC exceed those of recent methods. We conduct percentage sampling few-shot experiments to test the performance of the SFRC method in few-shot remote sensing image captioning with fewer samples. We also conduct ablation experiments on key designs in SFRC. The results of the ablation experiments prove that these self-learning designs we generated for captioning in sparse remote sensing sample scenarios are indeed fruitful, and each design contributes to the performance of the SFRC method.
Keywords