IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2023)
Object-Centric Masked Image Modeling-Based Self-Supervised Pretraining for Remote Sensing Object Detection
Abstract
Masked image modeling (MIM) has been proved to be an optimal pretext task for self-supervised pretraining (SSP), which can facilitate the model to capture an effective task-agnostic representation at the pretraining step and then advance the fine-tuning performance of various downstream tasks. However, under the high randomly masked ratio of MIM, the scene-level MIM-based SSP is hard to capture the small-scale objects or local details from complex remote sensing scenes. Then, when the pretrained models capturing more scene-level information are directly applied for object-level fine-tuning step, there is an obvious representation learning misalignment between model pretraining and fine-tuning steps. Therefore, in this article, a novel object-centric masked image modeling (OCMIM) strategy is proposed to make the model better capture the object-level information at the pretraining step and then further advance the object detection fine-tuning step. First, to better learn the object-level representation involving full scales and multicategories at MIM-based SSP, a novel object-centric data generator is proposed to automatically setup targeted pretraining data according to objects themselves, which can provide the specific data condition for object detection model pretraining. Second, an attention-guided mask generator is designed to generate a guided mask for MIM pretext task, which can lead the model to learn more discriminative representation of highly attended object regions than by using the randomly masking strategy. Finally, several experiments are conducted on six remote sensing object detection benchmarks, and results proved that the proposed OCMIM-based SSP strategy is a better pretraining way for remote sensing object detection than normally used methods.
Keywords