Validating the accuracy of deep learning for the diagnosis of pneumonia on chest x-ray against a robust multimodal reference diagnosis: a post hoc analysis of two prospective studies

Jeremy Hofmeister; Nicolas Garin; Xavier Montet; Max Scheffler; Alexandra Platon; Pierre-Alexandre Poletti; Jérôme Stirnemann; Marie-Pierre Debray; Yann-Erick Claessens; Xavier Duval; Virginie Prendki

doi:10.1186/s41747-023-00416-y

European Radiology Experimental (Feb 2024)

Validating the accuracy of deep learning for the diagnosis of pneumonia on chest x-ray against a robust multimodal reference diagnosis: a post hoc analysis of two prospective studies

Jeremy Hofmeister,
Nicolas Garin,
Xavier Montet,
Max Scheffler,
Alexandra Platon,
Pierre-Alexandre Poletti,
Jérôme Stirnemann,
Marie-Pierre Debray,
Yann-Erick Claessens,
Xavier Duval,
Virginie Prendki

Affiliations

Jeremy Hofmeister: Department of Diagnostics, Geneva University Hospitals
Nicolas Garin: Division of Internal Medicine, Riviera Chablais Hospital
Xavier Montet: Department of Diagnostics, Geneva University Hospitals
Max Scheffler: Department of Diagnostics, Geneva University Hospitals
Alexandra Platon: Department of Diagnostics, Geneva University Hospitals
Pierre-Alexandre Poletti: Department of Diagnostics, Geneva University Hospitals
Jérôme Stirnemann: Department of Medicine, Geneva University Hospitals
Marie-Pierre Debray: Department of Radiology, APHP, Hôpital Bichat, University Paris Cité
Yann-Erick Claessens: Department of Emergency Medicine, Centre Hospitalier Princesse Grace
Xavier Duval: Department of Epidemiology and Clinical ResearchInserm CIC 1425UMR 1138, APHP, Hôpital BichatUniversity Paris CitéIAME
Virginie Prendki: Department of Rehabilitation and Geriatrics, Geneva University Hospitals

DOI: https://doi.org/10.1186/s41747-023-00416-y
Journal volume & issue: Vol. 8, no. 1
pp. 1 – 10

Abstract

Read online

Abstract Background Artificial intelligence (AI) seems promising in diagnosing pneumonia on chest x-rays (CXR), but deep learning (DL) algorithms have primarily been compared with radiologists, whose diagnosis can be not completely accurate. Therefore, we evaluated the accuracy of DL in diagnosing pneumonia on CXR using a more robust reference diagnosis. Methods We trained a DL convolutional neural network model to diagnose pneumonia and evaluated its accuracy in two prospective pneumonia cohorts including 430 patients, for whom the reference diagnosis was determined a posteriori by a multidisciplinary expert panel using multimodal data. The performance of the DL model was compared with that of senior radiologists and emergency physicians reviewing CXRs and that of radiologists reviewing computed tomography (CT) performed concomitantly. Results Radiologists and DL showed a similar accuracy on CXR for both cohorts (p ≥ 0.269): cohort 1, radiologist 1 75.5% (95% confidence interval 69.1–80.9), radiologist 2 71.0% (64.4–76.8), DL 71.0% (64.4–76.8); cohort 2, radiologist 70.9% (64.7–76.4), DL 72.6% (66.5–78.0). The accuracy of radiologists and DL was significantly higher (p ≤ 0.022) than that of emergency physicians (cohort 1 64.0% [57.1–70.3], cohort 2 63.0% [55.6–69.0]). Accuracy was significantly higher for CT (cohort 1 79.0% [72.8–84.1], cohort 2 89.6% [84.9–92.9]) than for CXR readers including radiologists, clinicians, and DL (all p-values < 0.001). Conclusions When compared with a robust reference diagnosis, the performance of AI models to identify pneumonia on CXRs was inferior than previously reported but similar to that of radiologists and better than that of emergency physicians. Relevance statement The clinical relevance of AI models for pneumonia diagnosis may have been overestimated. AI models should be benchmarked against robust reference multimodal diagnosis to avoid overestimating its performance. Trial registration NCT02467192 , and NCT01574066 . Key point • We evaluated an openly-access convolutional neural network (CNN) model to diagnose pneumonia on CXRs. • CNN was validated against a strong multimodal reference diagnosis. • In our study, the CNN performance (area under the receiver operating characteristics curve 0.74) was lower than that previously reported when validated against radiologists’ diagnosis (0.99 in a recent meta-analysis). • The CNN performance was significantly higher than emergency physicians’ (p ≤ 0.022) and comparable to that of board-certified radiologists (p ≥ 0.269). Graphical Abstract

Published in European Radiology Experimental

ISSN: 2509-9280 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Medical physics. Medical radiology. Nuclear medicine
Website: https://eurradiolexp.springeropen.com/

About the journal

Abstract

Keywords