Evaluating the use of large language models to provide clinical recommendations in the Emergency Department

Christopher Y. K. Williams; Brenda Y. Miao; Aaron E. Kornblith; Atul J. Butte

doi:10.1038/s41467-024-52415-1

Nature Communications (Oct 2024)

Evaluating the use of large language models to provide clinical recommendations in the Emergency Department

Christopher Y. K. Williams,
Brenda Y. Miao,
Aaron E. Kornblith,
Atul J. Butte

Affiliations

Christopher Y. K. Williams: Bakar Computational Health Sciences Institute, University of California, San Francisco
Brenda Y. Miao: Bakar Computational Health Sciences Institute, University of California, San Francisco
Aaron E. Kornblith: Bakar Computational Health Sciences Institute, University of California, San Francisco
Atul J. Butte: Bakar Computational Health Sciences Institute, University of California, San Francisco

DOI: https://doi.org/10.1038/s41467-024-52415-1
Journal volume & issue: Vol. 15, no. 1
pp. 1 – 10

Abstract

Read online

Abstract The release of GPT-4 and other large language models (LLMs) has the potential to transform healthcare. However, existing research evaluating LLM performance on real-world clinical notes is limited. Here, we conduct a highly-powered study to determine whether LLMs can provide clinical recommendations for three tasks (admission status, radiological investigation(s) request status, and antibiotic prescription status) using clinical notes from the Emergency Department. We randomly selected 10,000 Emergency Department visits to evaluate the accuracy of zero-shot, GPT-3.5-turbo- and GPT-4-turbo-generated clinical recommendations across four different prompting strategies. We found that both GPT-4-turbo and GPT-3.5-turbo performed poorly compared to a resident physician, with accuracy scores 8% and 24%, respectively, lower than physician on average. Both LLMs tended to be overly cautious in its recommendations, with high sensitivity at the cost of specificity. Our findings demonstrate that, while early evaluations of the clinical use of LLMs are promising, LLM performance must be significantly improved before their deployment as decision support systems for clinical recommendations and other complex tasks.

Published in Nature Communications

ISSN: 2041-1723 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Science
Website: https://www.nature.com/ncomms/

About the journal