npj Digital Medicine (Nov 2024)

Interpretable machine learning model for digital lung cancer prescreening in Chinese populations with missing data

  • Shuaijie Zhang,
  • Qing Wang,
  • Xifeng Hu,
  • Botao Zhang,
  • Shuangshuang Sun,
  • Ying Yuan,
  • Xiaofeng Jia,
  • Yuanyuan Yu,
  • Fuzhong Xue

DOI
https://doi.org/10.1038/s41746-024-01309-z
Journal volume & issue
Vol. 7, no. 1
pp. 1 – 14

Abstract

Read online

Abstract We developed an interpretable model, BOUND (Bayesian netwOrk for large-scale lUng caNcer Digital prescreening), using a comprehensive EHR dataset from the China to improve lung cancer detection rates. BOUND employs Bayesian network uncertainty inference, allowing it to predict lung cancer risk even with missing data and identify high-risk factors. Developed using data from 905,194 individuals, BOUND achieved an AUC of 0.866 in internal validation, with time- and geography-based external validations yielding AUCs of 0.848 and 0.841, respectively. In datasets with 10%–70% missing data, AUC ranged from 0.827 – 0.746. The model demonstrates strong calibration, clinical utility, and robust performance in both balanced and imbalanced datasets. A risk scorecard was also created, improving detection rates up to 6.8 times, available free online ( https://drzhang1.aiself.net/ ). BOUND enables non-radiative, cost-effective lung cancer prescreening, excels with missing data, and addresses treatment inequities in resource-limited primary healthcare settings.