Experimental assessment of the performance of artificial intelligence in solving multiple-choice board exams in cardiology

Jessica Huwiler; Luca Oechslin; Patric Biaggi; Felix C. Tanner; Christophe Alain Wyss

doi:10.57187/s.3547

Swiss Medical Weekly (Oct 2024)

Experimental assessment of the performance of artificial intelligence in solving multiple-choice board exams in cardiology

Jessica Huwiler,
Luca Oechslin,
Patric Biaggi,
Felix C. Tanner,
Christophe Alain Wyss

Affiliations

Jessica Huwiler
Luca Oechslin
Patric Biaggi
Felix C. Tanner
Christophe Alain Wyss: Prof. Dr. med.

DOI: https://doi.org/10.57187/s.3547
Journal volume & issue: Vol. 154, no. 10

Abstract

Read online

AIMS: The aim of the present study was to evaluate the performance of various artificial intelligence (AI)-powered chatbots (commercially available in Switzerland up to June 2023) in solving a theoretical cardiology board exam and to compare their accuracy with that of human cardiology fellows. METHODS: For the study, a set of 88 multiple-choice cardiology exam questions was used. The participating cardiology fellows and selected chatbots were presented with these questions. The evaluation metrics included Top-1 and Top-2 accuracy, assessing the ability of chatbots and fellows to select the correct answer. RESULTS: Among the cardiology fellows, all 36 participants successfully passed the exam with a median accuracy of 98% (IQR 91–99%, range from 78% to 100%). However, the performance of the chatbots varied. Only one chatbot, Jasper quality, achieved the minimum pass rate of 73% correct answers. Most chatbots demonstrated a median Top-1 accuracy of 47% (IQR 44–53%, range from 42% to 73%), while Top-2 accuracy provided a modest improvement, resulting in a median accuracy of 67% (IQR 65–72%, range from 61% to 82%). Even with this advantage, only two chatbots, Jasper quality and ChatGPT plus 4.0, would have passed the exam. Similar results were observed when picture-based questions were excluded from the dataset. CONCLUSIONS: Overall, the study suggests that most current language-based chatbots have limitations in accurately solving theoretical medical board exams. In general, currently widely available chatbots fell short of achieving a passing score in a theoretical cardiology board exam. Nevertheless, a few showed promising results. Further improvements in artificial intelligence language models may lead to better performance in medical knowledge applications in the future.

Published in Swiss Medical Weekly

ISSN: 1424-7860 (Print); 1424-3997 (Online)
Publisher: SMW supporting association (Trägerverein Swiss Medical Weekly SMW)
Country of publisher: Switzerland
LCC subjects: Medicine
Website: http://www.smw.ch

About the journal