Multi-Dataset Comparison of Vision Transformers and Convolutional Neural Networks for Detecting Glaucomatous Optic Neuropathy from Fundus Photographs

Elizabeth E. Hwang; Dake Chen; Ying Han; Lin Jia; Jing Shan

doi:10.3390/bioengineering10111266

Bioengineering (Oct 2023)

Multi-Dataset Comparison of Vision Transformers and Convolutional Neural Networks for Detecting Glaucomatous Optic Neuropathy from Fundus Photographs

Elizabeth E. Hwang,
Dake Chen,
Ying Han,
Lin Jia,
Jing Shan

Affiliations

Elizabeth E. Hwang: Department of Ophthalmology, University of California, San Francisco, San Francisco, CA 94143, USA
Dake Chen: Department of Ophthalmology, University of California, San Francisco, San Francisco, CA 94143, USA
Ying Han: Department of Ophthalmology, University of California, San Francisco, San Francisco, CA 94143, USA
Lin Jia: Digillect LLC, San Francisco, CA 94158, USA
Jing Shan: Department of Ophthalmology, University of California, San Francisco, San Francisco, CA 94143, USA

DOI: https://doi.org/10.3390/bioengineering10111266
Journal volume & issue: Vol. 10, no. 11
p. 1266

Abstract

Read online

Glaucomatous optic neuropathy (GON) can be diagnosed and monitored using fundus photography, a widely available and low-cost approach already adopted for automated screening of ophthalmic diseases such as diabetic retinopathy. Despite this, the lack of validated early screening approaches remains a major obstacle in the prevention of glaucoma-related blindness. Deep learning models have gained significant interest as potential solutions, as these models offer objective and high-throughput methods for processing image-based medical data. While convolutional neural networks (CNN) have been widely utilized for these purposes, more recent advances in the application of Transformer architectures have led to new models, including Vision Transformer (ViT,) that have shown promise in many domains of image analysis. However, previous comparisons of these two architectures have not sufficiently compared models side-by-side with more than a single dataset, making it unclear which model is more generalizable or performs better in different clinical contexts. Our purpose is to investigate comparable ViT and CNN models tasked with GON detection from fundus photos and highlight their respective strengths and weaknesses. We train CNN and ViT models on six unrelated, publicly available databases and compare their performance using well-established statistics including AUC, sensitivity, and specificity. Our results indicate that ViT models often show superior performance when compared with a similarly trained CNN model, particularly when non-glaucomatous images are over-represented in a given dataset. We discuss the clinical implications of these findings and suggest that ViT can further the development of accurate and scalable GON detection for this leading cause of irreversible blindness worldwide.

Published in Bioengineering

ISSN: 2306-5354 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology; Science: Biology (General)
Website: https://www.mdpi.com/journal/bioengineering

About the journal

Abstract

Keywords