ChatGPT-4 Knows Its A B C D E but Cannot Cite Its Source

Diane Ghanem, MD; Alexander R. Zhu, BA; Whitney Kagabo, MD; Greg Osgood, MD, FAOA; Babar Shafiq, MD, FAOA

doi:10.2106/JBJS.OA.24.00099

JBJS Open Access (Sep 2024)

ChatGPT-4 Knows Its A B C D E but Cannot Cite Its Source

Diane Ghanem, MD,
Alexander R. Zhu, BA,
Whitney Kagabo, MD,
Greg Osgood, MD, FAOA,
Babar Shafiq, MD, FAOA

Affiliations

Diane Ghanem, MD: 1 Department of Orthopaedic Surgery, The Johns Hopkins Hospital, Baltimore, Maryland
Alexander R. Zhu, BA: 2 School of Medicine, The Johns Hopkins University, Baltimore, Maryland
Whitney Kagabo, MD: 1 Department of Orthopaedic Surgery, The Johns Hopkins Hospital, Baltimore, Maryland
Greg Osgood, MD, FAOA: 1 Department of Orthopaedic Surgery, The Johns Hopkins Hospital, Baltimore, Maryland
Babar Shafiq, MD, FAOA: 1 Department of Orthopaedic Surgery, The Johns Hopkins Hospital, Baltimore, Maryland

DOI: https://doi.org/10.2106/JBJS.OA.24.00099
Journal volume & issue: Vol. 9, no. 3

Abstract

Read online

Introduction:. The artificial intelligence language model Chat Generative Pretrained Transformer (ChatGPT) has shown potential as a reliable and accessible educational resource in orthopaedic surgery. Yet, the accuracy of the references behind the provided information remains elusive, which poses a concern for maintaining the integrity of medical content. This study aims to examine the accuracy of the references provided by ChatGPT-4 concerning the Airway, Breathing, Circulation, Disability, Exposure (ABCDE) approach in trauma surgery. Methods:. Two independent reviewers critically assessed 30 ChatGPT-4–generated references supporting the well-established ABCDE approach to trauma protocol, grading them as 0 (nonexistent), 1 (inaccurate), or 2 (accurate). All discrepancies between the ChatGPT-4 and PubMed references were carefully reviewed and bolded. Cohen's Kappa coefficient was used to examine the agreement of the accuracy scores of the ChatGPT-4–generated references between reviewers. Descriptive statistics were used to summarize the mean reference accuracy scores. To compare the variance of the means across the 5 categories, one-way analysis of variance was used. Results:. ChatGPT-4 had an average reference accuracy score of 66.7%. Of the 30 references, only 43.3% were accurate and deemed “true” while 56.7% were categorized as “false” (43.3% inaccurate and 13.3% nonexistent). The accuracy was consistent across the 5 trauma protocol categories, with no significant statistical difference (p = 0.437). Discussion:. With 57% of references being inaccurate or nonexistent, ChatGPT-4 has fallen short in providing reliable and reproducible references—a concerning finding for the safety of using ChatGPT-4 for professional medical decision making without thorough verification. Only if used cautiously, with cross-referencing, can this language model act as an adjunct learning tool that can enhance comprehensiveness as well as knowledge rehearsal and manipulation.

Published in JBJS Open Access

ISSN: 2472-7245 (Online)
Publisher: Wolters Kluwer
Country of publisher: United States
LCC subjects: Medicine: Surgery: Orthopedic surgery
Website: http://journals.lww.com/jbjsoa

About the journal