Vision Transformer With Contrastive Learning for Remote Sensing Image Scene Classification

Meiqiao Bi; Minghua Wang; Zhi Li; Danfeng Hong

doi:10.1109/JSTARS.2022.3230835

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2023)

Vision Transformer With Contrastive Learning for Remote Sensing Image Scene Classification

Meiqiao Bi,
Minghua Wang,
Zhi Li,
Danfeng Hong

Affiliations

Meiqiao Bi: Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China
Minghua Wang: ORCiD; Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China
Zhi Li: Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China
Danfeng Hong: ORCiD; Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China

DOI: https://doi.org/10.1109/JSTARS.2022.3230835
Journal volume & issue: Vol. 16
pp. 738 – 749

Abstract

Read online

Remote sensing images (RSIs) are characterized by complex spatial layouts and ground object structures. ViT can be a good choice for scene classification owing to the ability to capture long-range interactive information between patches of input images. However, due to the lack of some inductive biases inherent to CNNs, such as locality and translation equivariance, ViT cannot generalize well when trained on insufficient amounts of data. Compared with training ViT from scratch, transferring a large-scale pretrained one is more cost-efficient with better performance even when the target data are small scale. In addition, the cross-entropy (CE) loss is frequently utilized in scene classification yet has low robustness to noise labels and poor generalization performances for different scenes. In this article, a ViT-based model in combination with supervised contrastive learning (CL) is proposed, named ViT-CL. For CL, supervised contrastive (SupCon) loss, which is developed by extending the self-supervised contrastive approach to the fully supervised setting, can explore the label information of RSIs in embedding space and improve the robustness to common image corruption. In ViT-CL, a joint loss function that combines CE loss and SupCon loss is developed to prompt the model to learn more discriminative features. Also, a two-stage optimization framework is introduced to enhance the controllability of the optimization process of the ViT-CL model. Extensive experiments on the AID, NWPU-RESISC45, and UCM datasets verified the superior performance of ViT-CL, with the highest accuracies of 97.42%, 94.54%, and 99.76% among all competing methods, respectively.

Published in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

ISSN: 1939-1404 (Print); 2151-1535 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Ocean engineering; Science: Physics: Geophysics. Cosmic physics
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=4609443

About the journal

Abstract

Keywords