AI-Based CXR First Reading: Current Limitations to Ensure Practical Value
Yuriy Vasilev,
Anton Vladzymyrskyy,
Olga Omelyanskaya,
Ivan Blokhin,
Yury Kirpichev,
Kirill Arzamasov
Affiliations
Yuriy Vasilev
State Budget-Funded Health Care Institution of the City of Moscow “Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department”, Petrovka Street, 24, Building 1, 127051 Moscow, Russia
Anton Vladzymyrskyy
State Budget-Funded Health Care Institution of the City of Moscow “Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department”, Petrovka Street, 24, Building 1, 127051 Moscow, Russia
Olga Omelyanskaya
State Budget-Funded Health Care Institution of the City of Moscow “Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department”, Petrovka Street, 24, Building 1, 127051 Moscow, Russia
Ivan Blokhin
State Budget-Funded Health Care Institution of the City of Moscow “Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department”, Petrovka Street, 24, Building 1, 127051 Moscow, Russia
Yury Kirpichev
State Budget-Funded Health Care Institution of the City of Moscow “Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department”, Petrovka Street, 24, Building 1, 127051 Moscow, Russia
Kirill Arzamasov
State Budget-Funded Health Care Institution of the City of Moscow “Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department”, Petrovka Street, 24, Building 1, 127051 Moscow, Russia
We performed a multicenter external evaluation of the practical and clinical efficacy of a commercial AI algorithm for chest X-ray (CXR) analysis (Lunit INSIGHT CXR). A retrospective evaluation was performed with a multi-reader study. For a prospective evaluation, the AI model was run on CXR studies; the results were compared to the reports of 226 radiologists. In the multi-reader study, the area under the curve (AUC), sensitivity, and specificity of the AI were 0.94 (CI95%: 0.87–1.0), 0.9 (CI95%: 0.79–1.0), and 0.89 (CI95%: 0.79–0.98); the AUC, sensitivity, and specificity of the radiologists were 0.97 (CI95%: 0.94–1.0), 0.9 (CI95%: 0.79–1.0), and 0.95 (CI95%: 0.89–1.0). In most regions of the ROC curve, the AI performed a little worse or at the same level as an average human reader. The McNemar test showed no statistically significant differences between AI and radiologists. In the prospective study with 4752 cases, the AUC, sensitivity, and specificity of the AI were 0.84 (CI95%: 0.82–0.86), 0.77 (CI95%: 0.73–0.80), and 0.81 (CI95%: 0.80–0.82). Lower accuracy values obtained during the prospective validation were mainly associated with false-positive findings considered by experts to be clinically insignificant and the false-negative omission of human-reported “opacity”, “nodule”, and calcification. In a large-scale prospective validation of the commercial AI algorithm in clinical practice, lower sensitivity and specificity values were obtained compared to the prior retrospective evaluation of the data of the same population.