Evaluating prompt engineering on GPT-3.5’s performance in USMLE-style medical calculations and clinical scenarios generated by GPT-4

Dhavalkumar Patel; Ganesh Raut; Eyal Zimlichman; Satya Narayan Cheetirala; Girish N Nadkarni; Benjamin S. Glicksberg; Donald U. Apakama; Elijah J. Bell; Robert Freeman; Prem Timsina; Eyal Klang

doi:10.1038/s41598-024-66933-x

Scientific Reports (Jul 2024)

Evaluating prompt engineering on GPT-3.5’s performance in USMLE-style medical calculations and clinical scenarios generated by GPT-4

Dhavalkumar Patel,
Ganesh Raut,
Eyal Zimlichman,
Satya Narayan Cheetirala,
Girish N Nadkarni,
Benjamin S. Glicksberg,
Donald U. Apakama,
Elijah J. Bell,
Robert Freeman,
Prem Timsina,
Eyal Klang

Affiliations

Dhavalkumar Patel: Mount Sinai Health System
Ganesh Raut: Mount Sinai Health System
Eyal Zimlichman: Hospital Management, Sheba Medical Center, Affiliated to Tel-Aviv University
Satya Narayan Cheetirala: Mount Sinai Health System
Girish N Nadkarni: The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai
Benjamin S. Glicksberg: The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai
Donald U. Apakama: The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai
Elijah J. Bell: University of California
Robert Freeman: Mount Sinai Health System
Prem Timsina: Mount Sinai Health System
Eyal Klang: ARC Innovation Center, Sheba Medical Center, Affiliated to Tel-Aviv University

DOI: https://doi.org/10.1038/s41598-024-66933-x
Journal volume & issue: Vol. 14, no. 1
pp. 1 – 10

Abstract

Read online

Abstract This study was designed to assess how different prompt engineering techniques, specifically direct prompts, Chain of Thought (CoT), and a modified CoT approach, influence the ability of GPT-3.5 to answer clinical and calculation-based medical questions, particularly those styled like the USMLE Step 1 exams. To achieve this, we analyzed the responses of GPT-3.5 to two distinct sets of questions: a batch of 1000 questions generated by GPT-4, and another set comprising 95 real USMLE Step 1 questions. These questions spanned a range of medical calculations and clinical scenarios across various fields and difficulty levels. Our analysis revealed that there were no significant differences in the accuracy of GPT-3.5's responses when using direct prompts, CoT, or modified CoT methods. For instance, in the USMLE sample, the success rates were 61.7% for direct prompts, 62.8% for CoT, and 57.4% for modified CoT, with a p-value of 0.734. Similar trends were observed in the responses to GPT-4 generated questions, both clinical and calculation-based, with p-values above 0.05 indicating no significant difference between the prompt types. The conclusion drawn from this study is that the use of CoT prompt engineering does not significantly alter GPT-3.5's effectiveness in handling medical calculations or clinical scenario questions styled like those in USMLE exams. This finding is crucial as it suggests that performance of ChatGPT remains consistent regardless of whether a CoT technique is used instead of direct prompts. This consistency could be instrumental in simplifying the integration of AI tools like ChatGPT into medical education, enabling healthcare professionals to utilize these tools with ease, without the necessity for complex prompt engineering.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal

Abstract

Keywords