An assessment of ChatGPT in error detection for thyroid ultrasound reports: A comparative study with ultrasound physicians

Zhirong Xu; Jiayi Ye; Weiwen Luo; Lina Han; Hui Yin; Yanru Li; Qichen Su; Shanshan Su; Guorong Lyu; Shaohui Li

doi:10.1177/20552076251326019

Digital Health (Mar 2025)

An assessment of ChatGPT in error detection for thyroid ultrasound reports: A comparative study with ultrasound physicians

Zhirong Xu,
Jiayi Ye,
Weiwen Luo,
Lina Han,
Hui Yin,
Yanru Li,
Qichen Su,
Shanshan Su,
Guorong Lyu,
Shaohui Li

Affiliations

Zhirong Xu: Department of Ultrasound, Quanzhou, China
Jiayi Ye: Department of Nuclear Medicine, Second Affiliated Hospital of Fujian Medical University, Quanzhou, China
Weiwen Luo: Department of Medical Ultrasound, Zhangzhou Municipal Hospital of Fujian Province and Zhangzhou Affiliated Hospital of Fujian Medical University, Zhangzhou, China
Lina Han: Department of Ultrasound, Quanzhou, China
Hui Yin: Department of Ultrasound, Quanzhou, China
Yanru Li: Department of Ultrasound, Quanzhou, China
Qichen Su: Department of Ultrasound, Quanzhou, China
Shanshan Su: Department of Ultrasound, Quanzhou, China
Guorong Lyu: Department of Ultrasound, Quanzhou, China
Shaohui Li: Department of Ultrasound, Quanzhou, China

DOI: https://doi.org/10.1177/20552076251326019
Journal volume & issue: Vol. 11

Abstract

Read online

Background This study evaluates the performance of GPT-4o in detecting errors in ACR TIRADS ultrasound reports and its potential to reduce report generation time. Methods A retrospective analysis of 200 thyroid ultrasound reports from the Second Affiliated Hospital of Fujian Medical University was conducted, with reports categorized as correct or containing up to three errors. GPT-4o's performance was compared with ultrasound physicians of varying experience levels in error detection and processing time. Results GPT-4o detected 90.0% (180/200) of errors, slightly less than the best-performing senior ultrasound physician's 93.0% (186/200) with no significant difference ( p = 0.281). GPT-4o's error detection rate was comparable to that of ultrasound physicians overall ( p = 0.098 to 0.866). It outperformed Resident 2 in diagnostic errors (87% vs. 69%). Reader agreement was low (Cohen's kappa = 0 to 0.31). GPT-4o reviewed reports significantly faster than all ultrasound physicians (0.79 vs. 1.8 to 3.1 h, p < 0.001), making it a reliable and efficient tool for error detection in medical imaging. Conclusions GPT-4o is comparable to experienced ultrasound physicians in error detection and significantly improves report processing efficiency, offering a valuable tool for enhancing diagnostic accuracy and aiding junior residents.

Published in Digital Health

ISSN: 2055-2076 (Online)
Publisher: SAGE Publishing
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: https://journals.sagepub.com/home/dhj

About the journal