Object-Centric Masked Image Modeling-Based Self-Supervised Pretraining for Remote Sensing Object Detection

Tong Zhang; Yin Zhuang; He Chen; Liang Chen; Guanqun Wang; Peng Gao; Hao Dong

doi:10.1109/JSTARS.2023.3277588

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2023)

Object-Centric Masked Image Modeling-Based Self-Supervised Pretraining for Remote Sensing Object Detection

Tong Zhang,
Yin Zhuang,
He Chen,
Liang Chen,
Guanqun Wang,
Peng Gao,
Hao Dong

Affiliations

Tong Zhang: ORCiD; Beijing Key Laboratory of Embedded Real-time Information Processing Technology, Beijing Institute of Technology, Beijing, China
Yin Zhuang: ORCiD; Beijing Key Laboratory of Embedded Real-time Information Processing Technology, Beijing Institute of Technology, Beijing, China
He Chen: Beijing Key Laboratory of Embedded Real-time Information Processing Technology, Beijing Institute of Technology, Beijing, China
Liang Chen: Beijing Key Laboratory of Embedded Real-time Information Processing Technology, Beijing Institute of Technology, Beijing, China
Guanqun Wang: ORCiD; Beijing Key Laboratory of Embedded Real-time Information Processing Technology, Beijing Institute of Technology, Beijing, China
Peng Gao: Shang Hai AI laboratory, Shanghai, China
Hao Dong: ORCiD; Center on Frontiers of Computing Studies (CFCS), Peking University, Beijing, China

DOI: https://doi.org/10.1109/JSTARS.2023.3277588
Journal volume & issue: Vol. 16
pp. 5013 – 5025

Abstract

Read online

Masked image modeling (MIM) has been proved to be an optimal pretext task for self-supervised pretraining (SSP), which can facilitate the model to capture an effective task-agnostic representation at the pretraining step and then advance the fine-tuning performance of various downstream tasks. However, under the high randomly masked ratio of MIM, the scene-level MIM-based SSP is hard to capture the small-scale objects or local details from complex remote sensing scenes. Then, when the pretrained models capturing more scene-level information are directly applied for object-level fine-tuning step, there is an obvious representation learning misalignment between model pretraining and fine-tuning steps. Therefore, in this article, a novel object-centric masked image modeling (OCMIM) strategy is proposed to make the model better capture the object-level information at the pretraining step and then further advance the object detection fine-tuning step. First, to better learn the object-level representation involving full scales and multicategories at MIM-based SSP, a novel object-centric data generator is proposed to automatically setup targeted pretraining data according to objects themselves, which can provide the specific data condition for object detection model pretraining. Second, an attention-guided mask generator is designed to generate a guided mask for MIM pretext task, which can lead the model to learn more discriminative representation of highly attended object regions than by using the randomly masking strategy. Finally, several experiments are conducted on six remote sensing object detection benchmarks, and results proved that the proposed OCMIM-based SSP strategy is a better pretraining way for remote sensing object detection than normally used methods.

Published in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

ISSN: 1939-1404 (Print); 2151-1535 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Ocean engineering; Science: Physics: Geophysics. Cosmic physics
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=4609443

About the journal

Abstract

Keywords