Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology

Andrea Taloni; Massimiliano Borselli; Valentina Scarsi; Costanza Rossi; Giulia Coco; Vincenzo Scorcia; Giuseppe Giannaccare

doi:10.1038/s41598-023-45837-2

Scientific Reports (Oct 2023)

Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology

Andrea Taloni,
Massimiliano Borselli,
Valentina Scarsi,
Costanza Rossi,
Giulia Coco,
Vincenzo Scorcia,
Giuseppe Giannaccare

Affiliations

Andrea Taloni: Department of Ophthalmology, University Magna Graecia of Catanzaro
Massimiliano Borselli: Department of Ophthalmology, University Magna Graecia of Catanzaro
Valentina Scarsi: Department of Ophthalmology, University Magna Graecia of Catanzaro
Costanza Rossi: Department of Ophthalmology, University Magna Graecia of Catanzaro
Giulia Coco: Department of Clinical Sciences and Translational Medicine, University of Rome Tor Vergata
Vincenzo Scorcia: Department of Ophthalmology, University Magna Graecia of Catanzaro
Giuseppe Giannaccare: Department of Ophthalmology, University Magna Graecia of Catanzaro

DOI: https://doi.org/10.1038/s41598-023-45837-2
Journal volume & issue: Vol. 13, no. 1
pp. 1 – 7

Abstract

Read online

Abstract To compare the performance of humans, GPT-4.0 and GPT-3.5 in answering multiple-choice questions from the American Academy of Ophthalmology (AAO) Basic and Clinical Science Course (BCSC) self-assessment program, available at https://www.aao.org/education/self-assessments . In June 2023, text-based multiple-choice questions were submitted to GPT-4.0 and GPT-3.5. The AAO provides the percentage of humans who selected the correct answer, which was analyzed for comparison. All questions were classified by 10 subspecialties and 3 practice areas (diagnostics/clinics, medical treatment, surgery). Out of 1023 questions, GPT-4.0 achieved the best score (82.4%), followed by humans (75.7%) and GPT-3.5 (65.9%), with significant difference in accuracy rates (always P 50% of humans), both GPT models favorably compared to humans, without reaching significancy. The word count for answers provided by GPT-4.0 was significantly lower than those produced by GPT-3.5 (160 ± 56 and 206 ± 77 respectively, P < 0.0001); however, incorrect responses were longer (P < 0.02). GPT-4.0 represented a substantial improvement over GPT-3.5, achieving better performance than humans in an AAO BCSC self-assessment test. However, ChatGPT is still limited by inconsistency across different practice areas, especially when it comes to surgery.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal