Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments

Dana Brin; Vera Sorin; Akhil Vaid; Ali Soroush; Benjamin S. Glicksberg; Alexander W. Charney; Girish Nadkarni; Eyal Klang

doi:10.1038/s41598-023-43436-9

Scientific Reports (Oct 2023)

Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments

Dana Brin,
Vera Sorin,
Akhil Vaid,
Ali Soroush,
Benjamin S. Glicksberg,
Alexander W. Charney,
Girish Nadkarni,
Eyal Klang

Affiliations

Dana Brin: Department of Diagnostic Imaging, Chaim Sheba Medical Center
Vera Sorin: Department of Diagnostic Imaging, Chaim Sheba Medical Center
Akhil Vaid: The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai
Ali Soroush: Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai
Benjamin S. Glicksberg: Hasso Plattner Institute for Digital Health, Icahn School of Medicine at Mount Sinai
Alexander W. Charney: The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai
Girish Nadkarni: Division of Data-Driven and Digital Medicine (D3M), The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai
Eyal Klang: Department of Diagnostic Imaging, Chaim Sheba Medical Center

DOI: https://doi.org/10.1038/s41598-023-43436-9
Journal volume & issue: Vol. 13, no. 1
pp. 1 – 5

Abstract

Read online

Abstract The United States Medical Licensing Examination (USMLE) has been a subject of performance study for artificial intelligence (AI) models. However, their performance on questions involving USMLE soft skills remains unexplored. This study aimed to evaluate ChatGPT and GPT-4 on USMLE questions involving communication skills, ethics, empathy, and professionalism. We used 80 USMLE-style questions involving soft skills, taken from the USMLE website and the AMBOSS question bank. A follow-up query was used to assess the models’ consistency. The performance of the AI models was compared to that of previous AMBOSS users. GPT-4 outperformed ChatGPT, correctly answering 90% compared to ChatGPT’s 62.5%. GPT-4 showed more confidence, not revising any responses, while ChatGPT modified its original answers 82.5% of the time. The performance of GPT-4 was higher than that of AMBOSS's past users. Both AI models, notably GPT-4, showed capacity for empathy, indicating AI's potential to meet the complex interpersonal, ethical, and professional demands intrinsic to the practice of medicine.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal