Are ChatGPT’s Free-Text Responses on Periprosthetic Joint Infections of the Hip and Knee Reliable and Useful?

Alexander Draschl; Georg Hauer; Stefan Franz Fischerauer; Angelika Kogler; Lukas Leitner; Dimosthenis Andreou; Andreas Leithner; Patrick Sadoghi

doi:10.3390/jcm12206655

Journal of Clinical Medicine (Oct 2023)

Are ChatGPT’s Free-Text Responses on Periprosthetic Joint Infections of the Hip and Knee Reliable and Useful?

Alexander Draschl,
Georg Hauer,
Stefan Franz Fischerauer,
Angelika Kogler,
Lukas Leitner,
Dimosthenis Andreou,
Andreas Leithner,
Patrick Sadoghi

Affiliations

Alexander Draschl: Department of Orthopedics and Trauma, Medical University of Graz, Auenbruggerplatz 5, 8036 Graz, Austria
Georg Hauer: Department of Orthopedics and Trauma, Medical University of Graz, Auenbruggerplatz 5, 8036 Graz, Austria
Stefan Franz Fischerauer: Department of Orthopedics and Trauma, Medical University of Graz, Auenbruggerplatz 5, 8036 Graz, Austria
Angelika Kogler: Department of Orthopedics and Trauma, Medical University of Graz, Auenbruggerplatz 5, 8036 Graz, Austria
Lukas Leitner: Department of Orthopedics and Trauma, Medical University of Graz, Auenbruggerplatz 5, 8036 Graz, Austria
Dimosthenis Andreou: Department of Orthopedics and Trauma, Medical University of Graz, Auenbruggerplatz 5, 8036 Graz, Austria
Andreas Leithner: Department of Orthopedics and Trauma, Medical University of Graz, Auenbruggerplatz 5, 8036 Graz, Austria
Patrick Sadoghi: Department of Orthopedics and Trauma, Medical University of Graz, Auenbruggerplatz 5, 8036 Graz, Austria

DOI: https://doi.org/10.3390/jcm12206655
Journal volume & issue: Vol. 12, no. 20
p. 6655

Abstract

Read online

Background: This study aimed to evaluate ChatGPT’s performance on questions about periprosthetic joint infections (PJI) of the hip and knee. Methods: Twenty-seven questions from the 2018 International Consensus Meeting on Musculoskeletal Infection were selected for response generation. The free-text responses were evaluated by three orthopedic surgeons using a five-point Likert scale. Inter-rater reliability (IRR) was assessed via Fleiss’ kappa (FK). Results: Overall, near-perfect IRR was found for disagreement on the presence of factual errors (FK: 0.880, 95% CI [0.724, 1.035], p p p p p p < 0.001). Question- and subtopic-specific analysis revealed diverse IRR levels ranging from near-perfect to poor. Conclusions: ChatGPT’s free-text responses to complex orthopedic questions were predominantly reliable and useful for orthopedic surgeons and patients. Given variations in performance by question and subtopic, consulting additional sources and exercising careful interpretation should be emphasized for reliable medical decision-making.

Published in Journal of Clinical Medicine

ISSN: 2077-0383 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Medicine
Website: http://www.mdpi.com/journal/jcm

About the journal

Abstract

Keywords