Early detection of nasopharyngeal carcinoma through machine‐learning‐driven prediction model in a population‐based healthcare record database

Jeng‐Wen Chen; Shih‐Tsang Lin; Yi‐Chun Lin; Bo‐Sian Wang; Yu‐Ning Chien; Hung‐Yi Chiou

doi:10.1002/cam4.7144

Cancer Medicine (Apr 2024)

Early detection of nasopharyngeal carcinoma through machine‐learning‐driven prediction model in a population‐based healthcare record database

Jeng‐Wen Chen,
Shih‐Tsang Lin,
Yi‐Chun Lin,
Bo‐Sian Wang,
Yu‐Ning Chien,
Hung‐Yi Chiou

Affiliations

Jeng‐Wen Chen: Department of Otolaryngology–Head and Neck Surgery, Cardinal Tien Hospital and School of Medicine Fu Jen Catholic University New Taipei City Taiwan
Shih‐Tsang Lin: Department of Otolaryngology–Head and Neck Surgery, Cardinal Tien Hospital and School of Medicine Fu Jen Catholic University New Taipei City Taiwan
Yi‐Chun Lin: School of Public Health Taipei Medical University Taipei Taiwan
Bo‐Sian Wang: Institute of Population Health Sciences, National Health Research Institutes Miaoli Taiwan
Yu‐Ning Chien: Department of Health and Welfare University of Taipei Taiwan
Hung‐Yi Chiou: School of Public Health Taipei Medical University Taipei Taiwan

DOI: https://doi.org/10.1002/cam4.7144
Journal volume & issue: Vol. 13, no. 7
pp. n/a – n/a

Abstract

Read online

Abstract Objective Early diagnosis and treatment of nasopharyngeal carcinoma (NPC) are vital for a better prognosis. Still, because of obscure anatomical sites and insidious symptoms, nearly 80% of patients with NPC are diagnosed at a late stage. This study aimed to validate a machine learning (ML) model utilizing symptom‐related diagnoses and procedures in medical records to predict nasopharyngeal carcinoma (NPC) occurrence and reduce the prediagnostic period. Materials and Methods Data from a population‐based health insurance database (2001–2008) were analyzed, comparing adults with and without newly diagnosed NPC. Medical records from 90 to 360 days before diagnosis were examined. Five ML algorithms (Light Gradient Boosting Machine [LGB], eXtreme Gradient Boosting [XGB], Multivariate Adaptive Regression Splines [MARS], Random Forest [RF], and Logistics Regression [LG]) were evaluated for optimal early NPC detection. We further use a real‐world data of 1 million individuals randomly selected for testing the final model. Model performance was assessed using AUROC. Shapley values identified significant contributing variables. Results LGB showed maximum predictive power using 14 features and 90 days before diagnosis. The LGB models achieved AUROC, specificity, and sensitivity were 0.83, 0.81, and 0.64 for the test dataset, respectively. The LGB‐driven NPC predictive tool effectively differentiated patients into high‐risk and low‐risk groups (hazard ratio: 5.85; 95% CI: 4.75–7.21). The model‐layering effect is valid. Conclusions ML approaches using electronic medical records accurately predicted NPC occurrence. The risk prediction model serves as a low‐cost digital screening tool, offering rapid medical decision support to shorten prediagnostic periods. Timely referral is crucial for high‐risk patients identified by the model.

Published in Cancer Medicine

ISSN: 2045-7634 (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Medicine: Internal medicine: Neoplasms. Tumors. Oncology. Including cancer and carcinogens
Website: https://onlinelibrary.wiley.com/journal/20457634

About the journal

Abstract

Keywords