IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)
PolSAR Image Classification Via a Multigranularity Hybrid CNN-ViT Model With External Tokens and Cross-Attention
Abstract
With the development of deep learning technology, the application of convolutional neural network (CNN) and vision transformer (ViT) for polarimetric synthetic aperture radar (PolSAR) image classification has been deepened. However, the PolSAR image has very rich information due to its special data form, which makes it difficult for the existing single network structure to comprehensively extract such effective information. In addition, deep learning methods require a large amount of data for training, whereas PolSAR labeled data is scarce and difficult to obtain. Therefore, a multigranularity hybrid CNN-ViT model based on external tokens and cross-attention is proposed for PolSAR image classification. First of all, CNN is able to learn local features very well. Thus, a CNN-based external feature extractor is designed to extract local features from the PolSAR image. Then, ViT can focus on global features. So, a multigranularity attention structure is constructed for extracting global information at multiple scales. With these two modules, the model can fully access the feature information contained in PolSAR images, which is more advantageous than a single network structure. Next, to further utilize these features, a cross-attention feature fusion module is built for fusing global–local information of different granularities. Finally, by connecting with the softmax classifier, the network outputs the final prediction results. Experimental results on three benchmark datasets show that the present method using a small amount of labeled data for training also achieves the highest level of classification among the compared methods.
Keywords