Exploring the Performance of ChatGPT in an Orthopaedic Setting and Its Potential Use as an Educational Tool

Arthur Drouaud, BS; Carolina Stocchi, BS; Justin Tang, BS; Grant Gonsalves, BA; Zoe Cheung, MD; Jan Szatkowski, MD; David Forsh, MD

doi:10.2106/JBJS.OA.24.00081

JBJS Open Access (Dec 2024)

Exploring the Performance of ChatGPT in an Orthopaedic Setting and Its Potential Use as an Educational Tool

Arthur Drouaud, BS,
Carolina Stocchi, BS,
Justin Tang, BS,
Grant Gonsalves, BA,
Zoe Cheung, MD,
Jan Szatkowski, MD,
David Forsh, MD

Affiliations

Arthur Drouaud, BS: 1 George Washington University School of Medicine, Washington, District of Columbia
Carolina Stocchi, BS: 2 Department of Orthopaedic Surgery, Mount Sinai, New York, New York
Justin Tang, BS: 2 Department of Orthopaedic Surgery, Mount Sinai, New York, New York
Grant Gonsalves, BA: 2 Department of Orthopaedic Surgery, Mount Sinai, New York, New York
Zoe Cheung, MD: 3 Department of Orthopaedic Surgery, Staten Island University Hospital, Staten Island, New York
Jan Szatkowski, MD: 4 Department of Orthopaedic Surgery, Indiana University Health Methodist Hospital, Indianapolis, Indiana
David Forsh, MD: 2 Department of Orthopaedic Surgery, Mount Sinai, New York, New York

DOI: https://doi.org/10.2106/JBJS.OA.24.00081
Journal volume & issue: Vol. 9, no. 4

Abstract

Read online

Introduction:. We assessed ChatGPT-4 vision (GPT-4V)'s performance for image interpretation, diagnosis formulation, and patient management capabilities. We aim to shed light on its potential as an educational tool addressing real-life cases for medical students. Methods:. Ten of the most popular orthopaedic trauma cases from OrthoBullets were selected. GPT-4V interpreted medical imaging and patient information, providing diagnoses, and guiding responses to OrthoBullets questions. Four fellowship-trained orthopaedic trauma surgeons rated GPT-4V responses using a 5-point Likert scale (strongly disagree to strongly agree). Each of GPT-4V's answers was assessed for alignment with current medical knowledge (accuracy), rationale and whether it is logical (rationale), relevancy to the specific case (relevance), and whether surgeons would trust the answers (trustworthiness). Mean scores from surgeon ratings were calculated. Results:. In total, 10 clinical cases, comprising 97 questions, were analyzed (10 imaging, 35 management, and 52 treatment). The surgeons assigned a mean overall rating of 3.46/5.00 to GPT-4V's imaging response (accuracy 3.28, rationale 3.68, relevance 3.75, and trustworthiness 3.15). Management questions received an overall score of 3.76 (accuracy 3.61, rationale 3.84, relevance 4.01, and trustworthiness 3.58), while treatment questions had an average overall score of 4.04 (accuracy 3.99, rationale 4.08, relevance 4.15, and trustworthiness 3.93). Conclusion:. This is the first study evaluating GPT-4V's imaging interpretation, personalized management, and treatment approaches as a medical educational tool. Surgeon ratings indicate overall fair agreement in GPT-4V reasoning behind decision-making. GPT-4V performed less favorably in imaging interpretation compared with its management and treatment approach performance. The performance of GPT-4V falls below our fellowship-trained orthopaedic trauma surgeon's standards as a standalone tool for medical education.

Published in JBJS Open Access

ISSN: 2472-7245 (Online)
Publisher: Wolters Kluwer
Country of publisher: United States
LCC subjects: Medicine: Surgery: Orthopedic surgery
Website: http://journals.lww.com/jbjsoa

About the journal