IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)
A Novel Network-Level Fusion Architecture of Proposed Self-Attention and Vision Transformer Models for Land Use and Land Cover Classification From Remote Sensing Images
Abstract
Convolutional neural networks (CNNs), in particular, demonstrate the remarkable power of feature learning in remote sensing for land use and cover classification, as demonstrated by recent deep learning techniques driven by vast amounts of data. In this work, we proposed a new network-level fusion deep architecture based on 16-tiny Vision Transformer and SIBNet. In the initial phase, data augmentation has been performed to resolve the problem of data imbalances. In the next step, we proposed a self-attention bottleneck-based inception CNN network named SIBNet. In this network, two architectures are followed. The blocks are designed using inception architecture, and each inception module is created with bottleneck blocks. The 16-tiny vision transformer architecture has been implemented for RS images and fused using a network-level fusion with SIBNet for the first time. Hyperparameters of the proposed model have been initialized using Bayesian Optimization for better training on the RS images. After the fusion, the model was on RS image datasets and extracted deep features from the self-attention layer. The extracted features are classified using a neural network classifier with multiple hidden layers. The experimental process of the proposed architecture has been performed on two publically available datasets, such as EuroSAT and NWPU, and obtained an accuracy of 97.8 and 98.9%, respectively. A detailed ablation study has been performed to test the proposed models and shows that the fusion model achieved improved accuracy. In addition, a comparison is conducted with recent techniques and proposed methods, showing improved precision, recall, and accuracy.
Keywords