Evaluating ChatGPT-4’s Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases

Takanobu Hirosawa; Yukinori Harada; Kazuya Mizuta; Tetsu Sakamoto; Kazuki Tokumasu; Taro Shimizu

doi:10.2196/59267

JMIR Formative Research (Jun 2024)

Evaluating ChatGPT-4’s Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases

Takanobu Hirosawa,
Yukinori Harada,
Kazuya Mizuta,
Tetsu Sakamoto,
Kazuki Tokumasu,
Taro Shimizu

Affiliations

Takanobu Hirosawa: ORCiD
Yukinori Harada: ORCiD
Kazuya Mizuta: ORCiD
Tetsu Sakamoto: ORCiD
Kazuki Tokumasu: ORCiD
Taro Shimizu: ORCiD

DOI: https://doi.org/10.2196/59267
Journal volume & issue: Vol. 8
p. e59267

Abstract

Read online

BackgroundThe potential of artificial intelligence (AI) chatbots, particularly ChatGPT with GPT-4 (OpenAI), in assisting with medical diagnosis is an emerging research area. However, it is not yet clear how well AI chatbots can evaluate whether the final diagnosis is included in differential diagnosis lists. ObjectiveThis study aims to assess the capability of GPT-4 in identifying the final diagnosis from differential-diagnosis lists and to compare its performance with that of physicians for case report series. MethodsWe used a database of differential-diagnosis lists from case reports in the American Journal of Case Reports, corresponding to final diagnoses. These lists were generated by 3 AI systems: GPT-4, Google Bard (currently Google Gemini), and Large Language Models by Meta AI 2 (LLaMA2). The primary outcome was focused on whether GPT-4’s evaluations identified the final diagnosis within these lists. None of these AIs received additional medical training or reinforcement. For comparison, 2 independent physicians also evaluated the lists, with any inconsistencies resolved by another physician. ResultsThe 3 AIs generated a total of 1176 differential diagnosis lists from 392 case descriptions. GPT-4’s evaluations concurred with those of the physicians in 966 out of 1176 lists (82.1%). The Cohen κ coefficient was 0.63 (95% CI 0.56-0.69), indicating a fair to good agreement between GPT-4 and the physicians’ evaluations. ConclusionsGPT-4 demonstrated a fair to good agreement in identifying the final diagnosis from differential-diagnosis lists, comparable to physicians for case report series. Its ability to compare differential diagnosis lists with final diagnoses suggests its potential to aid clinical decision-making support through diagnostic feedback. While GPT-4 showed a fair to good agreement for evaluation, its application in real-world scenarios and further validation in diverse clinical environments are essential to fully understand its utility in the diagnostic process.

Published in JMIR Formative Research

ISSN: 2561-326X (Online)
Publisher: JMIR Publications
Country of publisher: Canada
LCC subjects: Medicine
Website: https://formative.jmir.org/

About the journal