Clinical Epidemiology (Jun 2021)

Cardiovascular Disease Prediction by Machine Learning Algorithms Based on Cytokines in Kazakhs of China

  • Jiang Y,
  • Zhang X,
  • Ma R,
  • Wang X,
  • Liu J,
  • Keerman M,
  • Yan Y,
  • Ma J,
  • Song Y,
  • Zhang J,
  • He J,
  • Guo S,
  • Guo H

Journal volume & issue
Vol. Volume 13
pp. 417 – 428


Read online

Yunxing Jiang,1,* Xianghui Zhang,1,* Rulin Ma,1 Xinping Wang,1 Jiaming Liu,1 Mulatibieke Keerman,1 Yizhong Yan,1 Jiaolong Ma,1 Yanpeng Song,1,2 Jingyu Zhang,1 Jia He,1 Shuxia Guo,1,3 Heng Guo1 1Department of Public Health, Shihezi University School of Medicine, Shihezi, Xinjiang, People’s Republic of China; 2The First Affiliated Hospital of Shihezi University Medical College, Shihezi, Xinjiang, People’s Republic of China; 3Department of Pathology and Key Laboratory of Xinjiang Endemic and Ethnic Diseases (Ministry of Education), Shihezi University School of Medicine, Shihezi, Xinjiang, People’s Republic of China*These authors contributed equally to this workCorrespondence: Shuxia Guo; Heng GuoDepartment of Public Health, Shihezi University School of Medicine, North 2th Road, Shihezi, Xinjiang, People’s Republic of ChinaTel +8618009932625Fax +8609932057153Email [email protected]; [email protected]: Cardiovascular disease (CVD) is the leading cause of mortality worldwide. Accurately identifying subjects at high-risk of CVD may improve CVD outcomes. We sought to systematically examine the feasibility and performance of 7 widely used machine learning (ML) algorithms in predicting CVD risks.Methods: The final analysis included 1508 Kazakh subjects in China without CVD at baseline who completed follow-up. All subjects were randomly divided into the training set (80%) and the test set (20%). L1-penalized logistic regression (LR), support vector machine with radial basis function (SVM), decision tree (DT), random forest (RF), k-nearest neighbors (KNN), Gaussian naive Bayes (NB), and extreme gradient boosting (XGB) were employed for prediction CVD outcomes. Ten-fold cross-validation was used during model developing and hyperparameters tuning in the training set. Model performance was evaluated in the test set in light of discrimination, calibration, and clinical usefulness. RF was applied to obtain the variable importance of included variables. Twenty-two variables, including sociodemographic characteristics, medical history, cytokines, and synthetic indices, were used for model development.Results: Among 1508 subjects, 203 were diagnosed with CVD over a median follow-up of 5.17 years. All 7 models had moderate to excellent discrimination (AUC ranged from 0.770 to 0.872) and were well calibrated. LR and SVM performed identically with an AUC of 0.872 (95% CI: 0.829– 0.907) and 0.868 (95% CI: 0.825– 0.904), respectively. LR had the lowest Brier score (0.078) and the highest sensitivity (97.1%). Decision curve analysis indicated that SVM was slightly better than LR. The inflammatory cytokines, such as hs-CRP and IL-6, were identified as strong predictors of CVD.Conclusion: SVM and LR can be applied to guide clinical decision-making in the Kazakh Chinese population, and further study is required to ensure their accuracies.Keywords: cardiovascular disease, prediction model, machine learning, Kazakh population