Interpretable machine learning model for digital lung cancer prescreening in Chinese populations with missing data

Shuaijie Zhang; Qing Wang; Xifeng Hu; Botao Zhang; Shuangshuang Sun; Ying Yuan; Xiaofeng Jia; Yuanyuan Yu; Fuzhong Xue

doi:10.1038/s41746-024-01309-z

npj Digital Medicine (Nov 2024)

Interpretable machine learning model for digital lung cancer prescreening in Chinese populations with missing data

Shuaijie Zhang,
Qing Wang,
Xifeng Hu,
Botao Zhang,
Shuangshuang Sun,
Ying Yuan,
Xiaofeng Jia,
Yuanyuan Yu,
Fuzhong Xue

Affiliations

Shuaijie Zhang: Department of Epidemiology and Health Statistics, School of Public Health, Cheeloo College of Medicine, Shandong University
Qing Wang: Department of Epidemiology and Health Statistics, School of Public Health, Cheeloo College of Medicine, Shandong University
Xifeng Hu: Department of Epidemiology and Health Statistics, School of Public Health, Cheeloo College of Medicine, Shandong University
Botao Zhang: Department of Epidemiology and Health Statistics, School of Public Health, Cheeloo College of Medicine, Shandong University
Shuangshuang Sun: Department of Epidemiology and Health Statistics, School of Public Health, Cheeloo College of Medicine, Shandong University
Ying Yuan: Department of Epidemiology and Health Statistics, School of Public Health, Cheeloo College of Medicine, Shandong University
Xiaofeng Jia: Health and Wellness Assurance Center Network Information Office of Boxing County
Yuanyuan Yu: Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University
Fuzhong Xue: Department of Epidemiology and Health Statistics, School of Public Health, Cheeloo College of Medicine, Shandong University

DOI: https://doi.org/10.1038/s41746-024-01309-z
Journal volume & issue: Vol. 7, no. 1
pp. 1 – 14

Abstract

Read online

Abstract We developed an interpretable model, BOUND (Bayesian netwOrk for large-scale lUng caNcer Digital prescreening), using a comprehensive EHR dataset from the China to improve lung cancer detection rates. BOUND employs Bayesian network uncertainty inference, allowing it to predict lung cancer risk even with missing data and identify high-risk factors. Developed using data from 905,194 individuals, BOUND achieved an AUC of 0.866 in internal validation, with time- and geography-based external validations yielding AUCs of 0.848 and 0.841, respectively. In datasets with 10%–70% missing data, AUC ranged from 0.827 – 0.746. The model demonstrates strong calibration, clinical utility, and robust performance in both balanced and imbalanced datasets. A risk scorecard was also created, improving detection rates up to 6.8 times, available free online ( https://drzhang1.aiself.net/ ). BOUND enables non-radiative, cost-effective lung cancer prescreening, excels with missing data, and addresses treatment inequities in resource-limited primary healthcare settings.

Published in npj Digital Medicine

ISSN: 2398-6352 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: https://www.nature.com/npjdigitalmed/

About the journal