Artificial intelligence in orthopaedic education: A comparative analysis of ChatGPT and Bing AI's Orthopaedic In‐Training Examination performance

Clark J. Chen; Vivek K. Bilolikar; Duncan VanNest; James Raphael; Gene Shaffer

doi:10.1002/med4.77

Medicine Advances (Sep 2024)

Artificial intelligence in orthopaedic education: A comparative analysis of ChatGPT and Bing AI's Orthopaedic In‐Training Examination performance

Clark J. Chen,
Vivek K. Bilolikar,
Duncan VanNest,
James Raphael,
Gene Shaffer

Affiliations

Clark J. Chen: Department of Orthopaedic Surgery Thomas Jefferson University Albert Einstein Healthcare Network Philadelphia Pennsylvania USA
Vivek K. Bilolikar: Department of Orthopaedic Surgery Thomas Jefferson University Albert Einstein Healthcare Network Philadelphia Pennsylvania USA
Duncan VanNest: Department of Orthopaedic Surgery Thomas Jefferson University Albert Einstein Healthcare Network Philadelphia Pennsylvania USA
James Raphael: Department of Orthopaedic Surgery Thomas Jefferson University Albert Einstein Healthcare Network Philadelphia Pennsylvania USA
Gene Shaffer: Department of Orthopaedic Surgery Thomas Jefferson University Albert Einstein Healthcare Network Philadelphia Pennsylvania USA

DOI: https://doi.org/10.1002/med4.77
Journal volume & issue: Vol. 2, no. 3
pp. 284 – 290

Abstract

Read online

Abstract Background This study evaluated the performance of generative artificial intelligence (AI) models on the Orthopaedic In‐Training Examination (OITE), an annual exam administered to U.S. orthopaedic residency programs. Methods ChatGPT 3.5 and Bing AI GPT 4.0 were evaluated on standardised sets of multiple‐choice questions drawn from the American Academy of Orthopaedic Surgeons OITE online question bank spanning 5 years (2018–2022). A total of 1165 questions were posed to each AI system. The performance of both systems was standardised using the latest versions of ChatGPT 3.5 and Bing AI GPT 4.0. Historical data of resident scores taken from the annual OITE technical reports was used as a comparison. Results Across the five datasets, ChatGPT 3.5 scored an average of 55.0% on the OITE questions. Bing AI GPT 4.0 scored higher with an average of 80.0%. In comparison, the average performance of orthopaedic residents in national accredited programs was 62.1%. Bing AI GPT 4.0 outperformed ChatGPT 3.5 and Accreditation Council for Graduate Medical Education examinees, and analysis of variance analysis demonstrated p < 0.001 among groups. The best performance was by Bing AI GPT 4.0 on OITE 2020. Conclusion Generative AI can provide a logical context across answer responses through its in‐depth information searches and citation of resources. This combination presents a convincing argument for the possible uses of AI in medical education as an interactive learning aid.

Published in Medicine Advances

ISSN: 2834-4391 (Print); 2834-4405 (Online)
Publisher: Wiley
Country of publisher: Australia
LCC subjects: Medicine
Website: https://onlinelibrary.wiley.com/journal/28344405

About the journal

Abstract

Keywords