Assessing AI efficacy in medical knowledge tests: A study using Taiwan's internal medicine exam questions from 2020 to 2023

Shih-Yi Lin; Ying-Yu Hsu; Shu-Woei Ju; Pei-Chun Yeh; Wu-Huei Hsu; Chia-Hung Kao

doi:10.1177/20552076241291404

Digital Health (Oct 2024)

Assessing AI efficacy in medical knowledge tests: A study using Taiwan's internal medicine exam questions from 2020 to 2023

Shih-Yi Lin,
Ying-Yu Hsu,
Shu-Woei Ju,
Pei-Chun Yeh,
Wu-Huei Hsu,
Chia-Hung Kao

Affiliations

Shih-Yi Lin: Division of Nephrology and Kidney Institute, , Taichung
Ying-Yu Hsu: 11th Grade Student, National Changhua Senior High School, Changhua
Shu-Woei Ju: Division of Nephrology and Kidney Institute, , Taichung
Pei-Chun Yeh: Artificial Intelligence Center, , Taichung
Wu-Huei Hsu: Department of Chest Medicine, , Taichung
Chia-Hung Kao: Department of Bioinformatics and Medical Engineering, Asia University, Taichung

DOI: https://doi.org/10.1177/20552076241291404
Journal volume & issue: Vol. 10

Abstract

Read online

Background The aim of this study is to evaluate the ability of generative artificial intelligence (AI) models to handle specialized medical knowledge and problem-solving in a formal examination context. Methods This research utilized internal medicine exam questions provided by the Taiwan Internal Medicine Society from 2020 to 2023, testing three AI models: GPT-4o, Claude_3.5 Sonnet, and Gemini Advanced models. Rejected queries for Gemini Advanced were translated into French for resubmission. Performance was assessed using IBM SPSS Statistics 26, with accuracy percentages calculated and statistical analyses such as Pearson correlation and analysis of variance (ANOVA) performed to gauge AI efficacy. Results GPT-4o’s top annual score was 86.25 in 2022, with an average of 81.97. Claude_3.5 Sonnet reached a peak score of 88.13 in 2021 and 2022, averaging 84.85, while Gemini Advanced lagged with an average score of 69.84. In specific specialties, Claude_3.5 Sonnet scored highest in Psychiatry (100%) and Nephrology (97.26%), with GPT-4o performing similarly well in Hematology & oncology (97.10%) and Nephrology (94.52%). Gemini's best scores were in Psychiatry (86.96%) and Hematology & Oncology (82.76%). Gemini Advanced models struggled with Neurology, scoring below 60%. Additionally, all models performed better on text-based questions than on image-based ones, without significant differences. Claude 3 Opus scored highest on COVID-19-related questions at 89.29%, followed by GPT-4o at 75.00% and Gemini Advanced at 67.86%. Conclusions AI models showed varied proficiency across medical specialties and question types. GPT-4o demonstrated higher image-based correction rates. Claude_3.5 Sonnet generally and consistently outperformed others, highlighting significant potential for AI in assisting medical education.

Published in Digital Health

ISSN: 2055-2076 (Online)
Publisher: SAGE Publishing
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: https://journals.sagepub.com/home/dhj

About the journal