AJO International (Jul 2025)

Evaluation of ChatGPT-4 in detecting referable diabetic retinopathy using single fundus images

  • Owais Aftab,
  • Hamza Khan,
  • Brian L. VanderBeek,
  • Drew Scoles,
  • Benjamin J. Kim,
  • Jonathan C. Tsui

DOI
https://doi.org/10.1016/j.ajoint.2025.100111
Journal volume & issue
Vol. 2, no. 2
p. 100111

Abstract

Read online

Purpose: Evaluate ChatGPT-4′s ability to identify referable diabetic retinopathy (DR) from single fundus images. Design: A cross-sectional study comparing ChatGPT-4′s versus retina specialists’ identification of more than mild DR (mtmDR) and vision-threatening DR (VTDR). Methods: Images in equal proportions of normal, mild, moderate, and severe nonproliferative DR (NPDR), proliferative DR (PDR), and blurry images with and without suspected PDR were presented to a panel of blinded retina specialists who identified images as readable or unreadable, and potentially as mtmDR or VTDR. These images were also submitted to ChatGPT-4 three times with a standardized prompt regarding mtmDR and VTDR. Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated for ChatGPT-4′s responses regarding mtmDR and VTDR as compared to the retina specialists majority determination. Results: Retina specialists read 158/180 prompts (87.7 %) with excellent interrater reliability while ChatGPT-4 read 132/180 (73.33 %) of the image prompts. For mtmDR, ChatGPT-4 demonstrated a sensitivity of 96.2 %, specificity of 19.1 %, PPV of 69.1 %, and NPV of 72.7 %. Overall, 90.9 % of prompts read by ChatGPT-4 were labeled as mtmDR. For VTDR, ChatGPT-4 demonstrated a 63.0 % sensitivity, 62.5 % specificity, 71.9 % PPV, and 52.6 % NPV compared to retina specialists. ChatGPT-4 labeled 51.5 % of read images as VTDR. Overall referability was 66.6 % for retina specialists and 93.3 % for ChatGPT-4. Conclusion: While ChatGPT-4 demonstrates promise in identifying moderate-to-severe DR, its limited specificity and tendency to overcall disease reduce its current utility as a screening tool.

Keywords