Advanced Ultrasound in Diagnosis and Therapy (Dec 2024)

Performance of ChatGPT and Radiology Residents on Ultrasonography Board-Style Questions

  • Jiale Xu, MD, Shujun Xia, MD, Qing Hua, MD, Zihan Mei, MD, Yiqing Hou, MD, Minyan Wei, MD, Limei Lai, MD, Yixuan Yang, MD, Jianqiao Zhou, MD

DOI
https://doi.org/10.37015/AUDT.2024.240002
Journal volume & issue
Vol. 8, no. 4
pp. 250 – 254

Abstract

Read online

Objective: This study aims to assess the performance of the Chat Generative Pre-Trained Transformer (ChatGPT), specifically versions GPT-3.5 and GPT-4, on ultrasonography board-style questions, and subsequently compare it with the performance of third-year radiology residents on the identical set of questions. Methods: The study, conducted from May 19 to May 30, 2023, utilized a selection of 134 multiple-choice questions sourced from a commercial question bank for American Registry for Diagnostic Medical Sonography (ARDMS) examinations and imported into the ChatGPT model (encompassing GPT-3.5 and GPT-4 versions). ChatGPT’s responses were evaluated overall, by topic, and by GPT version. An identical question set was assigned to three third-year radiology residents, enabling a direct comparison of performances with ChatGPT. Results: GPT-4 correctly responded to 82.1% of questions (110 of 134), significantly surpassing the performance of GPT-3.5 (P = 0.003), which correctly answered 66.4% of questions (89 of 134). Although GPT-3.5’s performance was statistically indistinguishable from the average performance of the radiology residents (66.7%, 89.3 of 134) (P = 0.969), there was a notable difference in the accuracy in question-answering accuracy between GPT-4 and the residents (P = 0.004). Conclusions: ChatGPT demonstrated significant competency in responding to ultrasonography board-style questions, with the GPT-4 version markedly surpassing both its predecessor GPT-3.5 and the radiology residents.

Keywords