Heliyon (Jun 2024)
ChatGPT achieves comparable accuracy to specialist physicians in predicting the efficacy of high-flow oxygen therapy
Abstract
Background: The failure of high-flow nasal cannula (HFNC) oxygen therapy can necessitate endotracheal intubation in patients, making timely prediction of the intubation risk following HFNC therapy crucial for reducing mortality due to delays in intubation. Objectives: To investigate the accuracy of ChatGPT in predicting the endotracheal intubation risk within 48 h following HFNC therapy and compare it with the predictive accuracy of specialist and non-specialist physicians. Methods: We conducted a prospective multicenter cohort study based on the data of 71 adult patients who received HFNC therapy. For each patient, their baseline data and physiological parameters after 6-h HFNC therapy were recorded to create a 6-alternative-forced-choice questionnaire that asked participants to predict the 48-h endotracheal intubation risk using scale options ranging from 1 to 6, with higher scores indicating a greater risk. GPT-3.5, GPT-4.0, respiratory and critical care specialist physicians and non-specialist physicians completed the same questionnaires (N = 71) respectively. We then determined the optimal diagnostic cutoff point, using the Youden index, for each predictor and 6-h ROX index, and compared their predictive performance using receiver operating characteristic (ROC) analysis. Results: The optimal diagnostic cutoff points were determined to be ≥ 4 for both GPT-4.0 and specialist physicians. GPT-4.0 demonstrated a precision of 76.1 %, with a specificity of 78.6 % (95%CI = 52.4–92.4 %) and sensitivity of 75.4 % (95%CI = 62.9–84.8 %). In comparison, the precision of specialist physicians was 80.3 %, with a specificity of 71.4 % (95%CI = 45.4–88.3 %) and sensitivity of 82.5 % (95%CI = 70.6–90.2 %). For GPT-3.5 and non-specialist physicians, the optimal diagnostic cutoff points were ≥5, with precisions of 73.2 % and 64.8 %, respectively. The area under the curve (AUC) in ROC analysis for GPT-4.0 was 0.821 (95%CI = 0.698–0.943), which was the highest among the predictors and significantly higher than that of non-specialist physicians [0.662 (95%CI = 0.518–0.805), P = 0.011]. Conclusion: GPT-4.0 achieves an accuracy level comparable to specialist physicians in predicting the 48-h endotracheal intubation risk following HFNC therapy, based on patient baseline data and physiological parameters after 6-h HFNC therapy.