Performance of ChatGPT and Radiology Residents on Ultrasonography Board-Style Questions

Jiale Xu, MD, Shujun Xia, MD, Qing Hua, MD, Zihan Mei, MD, Yiqing Hou, MD, Minyan Wei, MD, Limei Lai, MD, Yixuan Yang, MD, Jianqiao Zhou, MD

doi:10.37015/AUDT.2024.240002

Advanced Ultrasound in Diagnosis and Therapy (Dec 2024)

Performance of ChatGPT and Radiology Residents on Ultrasonography Board-Style Questions

Jiale Xu, MD, Shujun Xia, MD, Qing Hua, MD, Zihan Mei, MD, Yiqing Hou, MD, Minyan Wei, MD, Limei Lai, MD, Yixuan Yang, MD, Jianqiao Zhou, MD

Affiliations

Jiale Xu, MD, Shujun Xia, MD, Qing Hua, MD, Zihan Mei, MD, Yiqing Hou, MD, Minyan Wei, MD, Limei Lai, MD, Yixuan Yang, MD, Jianqiao Zhou, MD: aDepartment of Ultrasound, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, China;bCollege of Health Science and Technology, Shanghai Jiao Tong University School of Medicine, Shanghai, China

DOI: https://doi.org/10.37015/AUDT.2024.240002
Journal volume & issue: Vol. 8, no. 4
pp. 250 – 254

Abstract

Read online

Objective: This study aims to assess the performance of the Chat Generative Pre-Trained Transformer (ChatGPT), specifically versions GPT-3.5 and GPT-4, on ultrasonography board-style questions, and subsequently compare it with the performance of third-year radiology residents on the identical set of questions. Methods: The study, conducted from May 19 to May 30, 2023, utilized a selection of 134 multiple-choice questions sourced from a commercial question bank for American Registry for Diagnostic Medical Sonography (ARDMS) examinations and imported into the ChatGPT model (encompassing GPT-3.5 and GPT-4 versions). ChatGPT’s responses were evaluated overall, by topic, and by GPT version. An identical question set was assigned to three third-year radiology residents, enabling a direct comparison of performances with ChatGPT. Results: GPT-4 correctly responded to 82.1% of questions (110 of 134), significantly surpassing the performance of GPT-3.5 (P = 0.003), which correctly answered 66.4% of questions (89 of 134). Although GPT-3.5’s performance was statistically indistinguishable from the average performance of the radiology residents (66.7%, 89.3 of 134) (P = 0.969), there was a notable difference in the accuracy in question-answering accuracy between GPT-4 and the residents (P = 0.004). Conclusions: ChatGPT demonstrated significant competency in responding to ultrasonography board-style questions, with the GPT-4 version markedly surpassing both its predecessor GPT-3.5 and the radiology residents.

|artificial intelligence|ultrasonography|accuracy|medical education

Published in Advanced Ultrasound in Diagnosis and Therapy

ISSN: 2576-2508 (Print); 2576-2516 (Online)
Publisher: Editorial Office of Advanced Ultrasound in Diagnosis and Therapy
Country of publisher: China
LCC subjects: Medicine: Medicine (General): Medical technology
Website: http://www.audt.org

About the journal

Abstract

Keywords