PLOS Digital Health (Jun 2022)
Comparing different versions of computer-aided detection products when reading chest X-rays for tuberculosis
Abstract
Computer-aided detection (CAD) was recently recommended by the WHO for TB screening and triage based on several evaluations, but unlike traditional diagnostic tests, software versions are updated frequently and require constant evaluation. Since then, newer versions of two of the evaluated products have already been released. We used a case control sample of 12,890 chest X-rays to compare performance and model the programmatic effect of upgrading to newer versions of CAD4TB and qXR. We compared the area under the receiver operating characteristic curve (AUC), overall, and with data stratified by age, TB history, gender, and patient source. All versions were compared against radiologist readings and WHO’s Target Product Profile (TPP) for a TB triage test. Both newer versions significantly outperformed their predecessors in terms of AUC: CAD4TB version 6 (0.823 [0.816–0.830]), version 7 (0.903 [0.897–0.908]) and qXR version 2 (0.872 [0.866–0.878]), version 3 (0.906 [0.901–0.911]). Newer versions met WHO TPP values, older versions did not. All products equalled or surpassed the human radiologist performance with improvements in triage ability in newer versions. Humans and CAD performed worse in older age groups and among those with TB history. New versions of CAD outperform their predecessors. Prior to implementation CAD should be evaluated using local data because underlying neural networks can differ significantly. An independent rapid evaluation centre is necessitated to provide implementers with performance data on new versions of CAD products as they are developed. Author summary The World Health Organization recommended the use of artificial intelligence (AI)-powered computer-aided detection (CAD) for TB screening and triage in 2021. One year on, we comprehensively compare the performance of the newest versions of two CAD (CAD4TB and qXR) to their WHO-evaluated predecessors. We found that both newer versions significantly improved upon their predecessor’s ability to detect TB, performing better than the human readers. We also showed that the AI underlying new software versions can differ remarkably from the old and resemble an entirely new product altogether. We further demonstrate that, unlike laboratory diagnostic tools, CAD software updates could significantly impact the selection of appropriate threshold scores, the number of people with TB detected and cost-effectiveness. With newer CAD versions being rolled out almost annually, our results therefore underscore the need for rapid evidence generation to evaluate newer CAD versions in the fast-growing medical AI industry.