Evaluating AI Competence in Specialized Medicine: Comparative Analysis of ChatGPT and Neurologists in a Neurology Specialist Examination in Spain

Pablo Ros-Arlanzón; Angel Perez-Sempere

doi:10.2196/56762

JMIR Medical Education (Nov 2024)

Evaluating AI Competence in Specialized Medicine: Comparative Analysis of ChatGPT and Neurologists in a Neurology Specialist Examination in Spain

Pablo Ros-Arlanzón,
Angel Perez-Sempere

Affiliations

Pablo Ros-Arlanzón: ORCiD
Angel Perez-Sempere: ORCiD

DOI: https://doi.org/10.2196/56762
Journal volume & issue: Vol. 10
pp. e56762 – e56762

Abstract

Read online

Abstract BackgroundWith the rapid advancement of artificial intelligence (AI) in various fields, evaluating its application in specialized medical contexts becomes crucial. ChatGPT, a large language model developed by OpenAI, has shown potential in diverse applications, including medicine. ObjectiveThis study aims to compare the performance of ChatGPT with that of attending neurologists in a real neurology specialist examination conducted in the Valencian Community, Spain, assessing the AI’s capabilities and limitations in medical knowledge. MethodsWe conducted a comparative analysis using the 2022 neurology specialist examination results from 120 neurologists and responses generated by ChatGPT versions 3.5 and 4. The examination consisted of 80 multiple-choice questions, with a focus on clinical neurology and health legislation. Questions were classified according to Bloom’s Taxonomy. Statistical analysis of performance, including the κ coefficient for response consistency, was performed. ResultsHuman participants exhibited a median score of 5.91 (IQR: 4.93-6.76), with 32 neurologists failing to pass. ChatGPT-3.5 ranked 116th out of 122, answering 54.5% of questions correctly (score 3.94). ChatGPT-4 showed marked improvement, ranking 17th with 81.8% of correct answers (score 7.57), surpassing several human specialists. No significant variations were observed in the performance on lower-order questions versus higher-order questions. Additionally, ChatGPT-4 demonstrated increased interrater reliability, as reflected by a higher κ coefficient of 0.73, compared to ChatGPT-3.5’s coefficient of 0.69. ConclusionsThis study underscores the evolving capabilities of AI in medical knowledge assessment, particularly in specialized fields. ChatGPT-4’s performance, outperforming the median score of human participants in a rigorous neurology examination, represents a significant milestone in AI development, suggesting its potential as an effective tool in specialized medical education and assessment.

Published in JMIR Medical Education

ISSN: 2369-3762 (Online)
Publisher: JMIR Publications
Country of publisher: Canada
LCC subjects: Education: Special aspects of education; Medicine: Medicine (General)
Website: https://mededu.jmir.org

About the journal