IEEE Access (Jan 2024)

A Deep Learning Approach for Crop Disease and Pest Classification Using Swin Transformer and Dual-Attention Multi-Scale Fusion Network

  • R. Karthik,
  • Armaano Ajay,
  • Akshaj Singh Bisht,
  • T. Illakiya,
  • K. Suganthi

DOI
https://doi.org/10.1109/ACCESS.2024.3481675
Journal volume & issue
Vol. 12
pp. 152639 – 152655

Abstract

Read online

Crops are essential for the survival of the human race, and their significance steadily rises alongside the increasing global population. Cashew, cassava, maize, and tomato are major economic crops that support the livelihoods, economies, and cultures of many countries. However, these crops face several threats from diseases and pests, which can lead to significant yield and economic losses. Early identification of crop diseases is essential for protecting crops and ensuring global food supply. Current diagnostic methods are mostly manual, which is time-consuming and requires domain expertise. Visual inspection by farmers for disease and pest symptoms often leads to delayed identification, leading to the progression of diseases to more severe stages. Deep learning offers a fast and accurate solution to this problem by automating the process of disease and pest detection. This research proposes a new feature fusion-based model that uses two parallel tracks to extract features from plant images. The Swin transformer track captures global features through shifted windows and self-attention. The Dual-Attention Multi-scale Fusion Network (DAMFN) track extracts local features using two specialised blocks: Multi-Separable Attention (MSA) and Tri Shuffle Convolution Attention (TSCA) blocks. These blocks have varied kernel sizes and attention modules to capture refined features at multiple scales. To our current understanding, this study is the first to present the results derived from a dual-track architecture employed for the detection of crop diseases and pests, utilizing a combination of the Swin transformer and DAMFN networks. The model then combines the features extracted by both tracks and uses triplet attention to focus on the most informative regions for precise classification. The performance of the proposed network surpassed several state-of-the-art architectures and achieved an accuracy of 95.68% on the CCMT dataset.

Keywords