A Novel Transformer Model With Multiple Instance Learning for Diabetic Retinopathy Classification

Yaoming Yang; Zhili Cai; Shuxia Qiu; Peng Xu

doi:10.1109/ACCESS.2024.3351473

IEEE Access (Jan 2024)

A Novel Transformer Model With Multiple Instance Learning for Diabetic Retinopathy Classification

Yaoming Yang,
Zhili Cai,
Shuxia Qiu,
Peng Xu

Affiliations

Yaoming Yang: ORCiD; College of Science, China Jiliang University, Hangzhou, China
Zhili Cai: College of Science, China Jiliang University, Hangzhou, China
Shuxia Qiu: ORCiD; College of Science, China Jiliang University, Hangzhou, China
Peng Xu: ORCiD; College of Science, China Jiliang University, Hangzhou, China

DOI: https://doi.org/10.1109/ACCESS.2024.3351473
Journal volume & issue: Vol. 12
pp. 6768 – 6776

Abstract

Read online

Diabetic retinopathy (DR) is an irreversible fundus retinopathy. A deep learning-based automated DR diagnosis system can save diagnostic time. While Transformer has shown superior performance compared to Convolutional Neural Network (CNN), it typically requires pre-training with large amounts of data. Although Transformer-based DR diagnosis method may alleviate the problem of limited performance on small-scale retinal datasets by loading pre-trained weights, the size of input images is restricted to $224\times 224$ . The resolution of retinal images captured by fundus cameras is much higher than $224\times 224$ , reducing resolution in training will result in the loss of valuable information. In order to efficiently utilize high-resolution retinal images, a new Transformer model with multiple instance learning (TMIL) is proposed for DR classification. A multiple instance learning approach is firstly applied on the retinal images to segment these high-resolution images into $224\times 224$ image patches. Subsequently, Vision Transformer (ViT) is used to extract features from each patch. Then, Global Instance Computing Block (GICB) is designed to calculate the inter-instance features. After introducing global information from GICB, the features are used to output the classification results. When using high-resolution retinal images, TMIL can load pre-trained weights of Transformer without being affected by weight interpolation on model performance. Experimental results using the APTOS dataset and the Messidor-1 dataset demonstrate that TMIL achieves better classification performance and reduces inference time by 62% compared with that directly inputting high-resolution images into ViT. And TMIL shows highest classification accuracy compared with the current state-of-the-art results. The code will publicly available at https://github.com/CNMaxYang/TMIL.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords