Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4

Adi Lahat; Kassem Sharif; Narmin Zoabi; Yonatan Shneor Patt; Yousra Sharif; Lior Fisher; Uria Shani; Mohamad Arow; Roni Levin; Eyal Klang

doi:10.2196/54571

Journal of Medical Internet Research (Jun 2024)

Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4

Adi Lahat,
Kassem Sharif,
Narmin Zoabi,
Yonatan Shneor Patt,
Yousra Sharif,
Lior Fisher,
Uria Shani,
Mohamad Arow,
Roni Levin,
Eyal Klang

Affiliations

Adi Lahat: ORCiD
Kassem Sharif: ORCiD
Narmin Zoabi: ORCiD
Yonatan Shneor Patt: ORCiD
Yousra Sharif: ORCiD
Lior Fisher: ORCiD
Uria Shani: ORCiD
Mohamad Arow: ORCiD
Roni Levin: ORCiD
Eyal Klang: ORCiD

DOI: https://doi.org/10.2196/54571
Journal volume & issue: Vol. 26
p. e54571

Abstract

Read online

BackgroundArtificial intelligence, particularly chatbot systems, is becoming an instrumental tool in health care, aiding clinical decision-making and patient engagement. ObjectiveThis study aims to analyze the performance of ChatGPT-3.5 and ChatGPT-4 in addressing complex clinical and ethical dilemmas, and to illustrate their potential role in health care decision-making while comparing seniors’ and residents’ ratings, and specific question types. MethodsA total of 4 specialized physicians formulated 176 real-world clinical questions. A total of 8 senior physicians and residents assessed responses from GPT-3.5 and GPT-4 on a 1-5 scale across 5 categories: accuracy, relevance, clarity, utility, and comprehensiveness. Evaluations were conducted within internal medicine, emergency medicine, and ethics. Comparisons were made globally, between seniors and residents, and across classifications. ResultsBoth GPT models received high mean scores (4.4, SD 0.8 for GPT-4 and 4.1, SD 1.0 for GPT-3.5). GPT-4 outperformed GPT-3.5 across all rating dimensions, with seniors consistently rating responses higher than residents for both models. Specifically, seniors rated GPT-4 as more beneficial and complete (mean 4.6 vs 4.0 and 4.6 vs 4.1, respectively; P<.001), and GPT-3.5 similarly (mean 4.1 vs 3.7 and 3.9 vs 3.5, respectively; P<.001). Ethical queries received the highest ratings for both models, with mean scores reflecting consistency across accuracy and completeness criteria. Distinctions among question types were significant, particularly for the GPT-4 mean scores in completeness across emergency, internal, and ethical questions (4.2, SD 1.0; 4.3, SD 0.8; and 4.5, SD 0.7, respectively; P<.001), and for GPT-3.5’s accuracy, beneficial, and completeness dimensions. ConclusionsChatGPT’s potential to assist physicians with medical issues is promising, with prospects to enhance diagnostics, treatments, and ethics. While integration into clinical workflows may be valuable, it must complement, not replace, human expertise. Continued research is essential to ensure safe and effective implementation in clinical environments.

Published in Journal of Medical Internet Research

ISSN: 1438-8871 (Online)
Publisher: JMIR Publications
Country of publisher: Canada
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Medicine: Public aspects of medicine
Website: https://www.jmir.org

About the journal