The Lancet Regional Health. Western Pacific (Feb 2025)

Language model-based early detection of colorectal cancer occurrence from previous disease diagnosis trajectory data for primary care purpose

  • Zhiyao Luo,
  • Jiannan Yang,
  • Tingting Zhu,
  • William Chi Wai Wong,
  • Jiandong Zhou

DOI
https://doi.org/10.1016/j.lanwpc.2024.101381
Journal volume & issue
Vol. 55
p. 101381

Abstract

Read online

The early prediction of colorectal cancer (CRC) in general patients is critical for improving outcomes through timely diagnosis and intervention. However, accurately predicting CRC is challenging due to the substantial variability in disease progression and the heterogeneity in patient presentations. In this study, we propose a non-invasive, accurate approach for predicting early-stage CRC using language models applied to longitudinal electronic health records (EHR) data.Our study evaluates the effectiveness of multiple language models, including Transformer, Recurrent Neural Network (RNN), and Multi-Layer Perceptron (MLP), in predicting CRC occurrence. The models were trained using a two-stage process of pretraining and finetuning. For model development, we used a dataset comprising the diagnostic history of 136,801 patients (45.6% male, median age 65.8 years [IQR: 52.5-76.0]) admitted between January 1, 2000, and December 31, 2003, with follow-up data through December 31, 2023. A five-fold cross-validation approach was employed, with 80% of the data used for model training and 20% reserved for evaluation. This training process was repeated five times to mitigate potential overfitting. To gain insights into the model’s interpretability, we applied the GradientSHAP method to identify key disease features along the CRC risk trajectory. We further validated the model on an independent cohort of 1 million family medicine patients (46.9% male, median age at recruitment 63.8 years [IQR: 54.8-77.5]), enrolled between January 1, 2008, and December 31, 2009, with follow-up until May 31, 2024. This cohort included 4, 301 patients diagnosed with CRC (63.0% male, median age at diagnosis 66.6 years [IQR: 56.3-76.3]), alongside matched controls. Notably, none of these patients were included in the model training set. For the sensitivity analysis, we employed a bootstrapping statistical test with 1,000 iterations to assess metrics such as AUROC and AUPRC for significance.Among the evaluated models (i.e., Transformer, RNN, and MLP model with variants of using only diagnosis trajectory, both demographics and diagnosis trajectory, all demographics, diagnosis trajectory and medication histories as model input), RNN demonstrated superior performance in predicting CRC when incorporating patient demographics, disease history, and medication data. The RNN achieved an area under the receiver operating characteristic (AUROC) score of 0.887 (95% CI: 0.886-0.889) on the holdout validation cohort and 0.903 (95% CI: 0.882-0.902) on the independent validation cohort. This performance surpassed that of the Transformer and MLP models. The GradientSHAP analysis revealed that the RNN model’s predictions aligned well with clinical observations. Specifically, the model identified histories of “Crohn’s disease” (ICD-9: 555.x), “Ulcerative colitis” (ICD-9: 556.x), “Overweight and obesity” (ICD-9: 278), and “Diabetes” (ICD-9: 250.x) as the top predictors of CRC, which is consistent with established clinical risk factors.In summary, this study demonstrates that language model-based analysis of EHR data offers a highly accurate method for the early detection of CRC in general patients. This approach holds great promise for practical application in primary care settings, potentially improving early diagnosis and patient survival rates.