IEEE Access (Jan 2023)
Vision Transformer Model for Predicting the Severity of Diabetic Retinopathy in Fundus Photography-Based Retina Images
Abstract
Diabetic Retinopathy (DR) is a result of prolonged diabetes with poor blood sugar management. It causes vision problems and blindness due to the deformation of the human retina. Recently, DR has become a crucial medical problem that affects the health and life of people. Diagnosis of DR can be done manually by ophthalmologists, but this is cumbersome and time consuming especially in the current overloaded physician’s environment. The early detection and prevention of DR, a severe complication of diabetes that can lead to blindness, require an automatic, accurate, and personalized machine learning-based method. Various deep learning algorithms, particularly convolutional neural networks (CNNs), have been investigated for detecting different stages of DR. Recently, transformers have proved their capabilities in natural language processing. Vision transformers (ViTs) are extensions of these models to capture long-range dependencies in images, which achieved better results than CNN models. However, ViT always needs huge datasets to learn properly, and this condition reduced its applicability in DR domain. Recently, a new real-world and large fundus image dataset called fine-grained annotated diabetic retinopathy (FGADR) has been released which supported the application of ViT in DR diagnosis domain. The literature has not explored FGADR to optimize ViT models. In this paper, we propose a novel ViT based deep learning pipeline for detecting the severity stages of DR based on fundus photography-based retina images. The model has been built using FGADR dataset. The model has been optimized using a new optimizer called AdamW to detect the global context of images. Because FGADR is an imbalanced dataset, we combine several techniques for handling this issue including the usage of F1-score as the optimization metric, data augmentation, class weights, label smoothing, and focal loss. Extensive experiments have been conducted to explore the role of ViT with different data balancing techniques to detect DR. In addition, the proposed model has been compared with the state-of-the-art CNN algorithms such as ResNet50, Incep-tionV3, and VGG19. The adopted model was able to capture the crucial features of retinal images to understand DR severity better. It achieved superior results compared to other CNN and baseline ViT models (i.e., 0.825, 0.825, 0.826, 0.964, 0.825, 0.825, and 0.956 for F1-score, accuracy, balanced accuracy, AUC, precision, recall, specificity, respectively). The results of the proposed ViT model were quite encouraging to be applied in real medical environment for assisting physicians to make accurate, personalized, and timely decisions.
Keywords