Learning Health Systems (Jul 2024)

Diagnostic accuracy of GPT‐4 on common clinical scenarios and challenging cases

  • Geoffrey W. Rutledge

DOI
https://doi.org/10.1002/lrh2.10438
Journal volume & issue
Vol. 8, no. 3
pp. n/a – n/a

Abstract

Read online

Abstract Introduction Large language models (LLMs) have a high diagnostic accuracy when they evaluate previously published clinical cases. Methods We compared the accuracy of GPT‐4's differential diagnoses for previously unpublished challenging case scenarios with the diagnostic accuracy for previously published cases. Results For a set of previously unpublished challenging clinical cases, GPT‐4 achieved 61.1% correct in its top 6 diagnoses versus the previously reported 49.1% for physicians. For a set of 45 clinical vignettes of more common clinical scenarios, GPT‐4 included the correct diagnosis in its top 3 diagnoses 100% of the time versus the previously reported 84.3% for physicians. Conclusions GPT‐4 performs at a level at least as good as, if not better than, that of experienced physicians on highly challenging cases in internal medicine. The extraordinary performance of GPT‐4 on diagnosing common clinical scenarios could be explained in part by the fact that these cases were previously published and may have been included in the training dataset for this LLM.

Keywords