A systematic evaluation of the performance of GPT‐4 and PaLM2 to diagnose comorbidities in MIMIC‐IV patients

Peter Sarvari; Zaid Al‐fagih; Abdullatif Ghuwel; Othman Al‐fagih

doi:10.1002/hcs2.79

Health Care Science (Feb 2024)

A systematic evaluation of the performance of GPT‐4 and PaLM2 to diagnose comorbidities in MIMIC‐IV patients

Peter Sarvari,
Zaid Al‐fagih,
Abdullatif Ghuwel,
Othman Al‐fagih

Affiliations

Peter Sarvari: Rhazes AI London UK
Zaid Al‐fagih: Rhazes AI London UK
Abdullatif Ghuwel: National Health Service England London UK
Othman Al‐fagih: National Health Service England London UK

DOI: https://doi.org/10.1002/hcs2.79
Journal volume & issue: Vol. 3, no. 1
pp. 3 – 18

Abstract

Read online

Abstract Background Given the strikingly high diagnostic error rate in hospitals, and the recent development of Large Language Models (LLMs), we set out to measure the diagnostic sensitivity of two popular LLMs: GPT‐4 and PaLM2. Small‐scale studies to evaluate the diagnostic ability of LLMs have shown promising results, with GPT‐4 demonstrating high accuracy in diagnosing test cases. However, larger evaluations on real electronic patient data are needed to provide more reliable estimates. Methods To fill this gap in the literature, we used a deidentified Electronic Health Record (EHR) data set of about 300,000 patients admitted to the Beth Israel Deaconess Medical Center in Boston. This data set contained blood, imaging, microbiology and vital sign information as well as the patients' medical diagnostic codes. Based on the available EHR data, doctors curated a set of diagnoses for each patient, which we will refer to as ground truth diagnoses. We then designed carefully‐written prompts to get patient diagnostic predictions from the LLMs and compared this to the ground truth diagnoses in a random sample of 1000 patients. Results Based on the proportion of correctly predicted ground truth diagnoses, we estimated the diagnostic hit rate of GPT‐4 to be 93.9%. PaLM2 achieved 84.7% on the same data set. On these 1000 randomly selected EHRs, GPT‐4 correctly identified 1116 unique diagnoses. Conclusion The results suggest that artificial intelligence (AI) has the potential when working alongside clinicians to reduce cognitive errors which lead to hundreds of thousands of misdiagnoses every year. However, human oversight of AI remains essential: LLMs cannot replace clinicians, especially when it comes to human understanding and empathy. Furthermore, a significant number of challenges in incorporating AI into health care exist, including ethical, liability and regulatory barriers.

Published in Health Care Science

ISSN: 2771-1749 (Print); 2771-1757 (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Medicine: Public aspects of medicine
Website: https://onlinelibrary.wiley.com/journal/27711757

About the journal

Abstract

Keywords