IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)
Fine-Grained Information Supplementation and Value-Guided Learning for Remote Sensing Image-Text Retrieval
Abstract
Remote sensing (RS) image-text retrieval is a practical and challenging task that has received considerable attention. Currently, most approaches rely on either convolutional neural networks or Transformers, which cannot effectively extract both global and fine-grained features simultaneously. Furthermore, the problem of high intramodal similarity in the RS domain poses a challenge for feature learning. In addition, the characteristics of model training at different stages seem to be neglected in most studies. In order to tackle these problems, we propose a fine-grained information supplementation (FGIS) and value-guided learning model that leverages prior knowledge in the RS domain for feature supplementation and employs a value-guided training approach to learn fine-grained, expressive, and robust feature representations. Specifically, we introduce the FGIS module to facilitate the supplementation of fine-grained visual features, thereby enhancing perceptual abilities for both global and local features. Furthermore, we mitigate the problem of high intra-modal similarity by proposing two loss functions: the weighted contrastive loss and the scene-adaptive fine-grained perceptual loss. Finally, we design a value-guided learning framework that focuses on the most important information at each stage of training. Extensive experiments on the remote sensing image captioning dataset (RSICD) and remote sensing image text match dataset (RSITMD) datasets verify the effectiveness and superiority of our model.
Keywords