Journal of International Medical Research (Jun 2024)

Comparing the accuracy of four machine learning models in predicting type 2 diabetes onset within the Chinese population: a retrospective study

  • Hongzhou Liu,
  • Song Dong,
  • Hua Yang,
  • Linlin Wang,
  • Jia Liu,
  • Yangfan Du,
  • Jing Liu,
  • Zhaohui Lyu,
  • Yuhan Wang,
  • Li Jiang,
  • Shasha Yu,
  • Xiaomin Fu

DOI
https://doi.org/10.1177/03000605241253786
Journal volume & issue
Vol. 52

Abstract

Read online

Objective To evaluate the effectiveness of machine learning (ML) models in predicting 5-year type 2 diabetes mellitus (T2DM) risk within the Chinese population by retrospectively analyzing annual health checkup records. Methods We included 46,247 patients (32,372 and 13,875 in training and validation sets, respectively) from a national health checkup center database. Univariate and multivariate Cox analyses were performed to identify factors influencing T2DM risk. Extreme Gradient Boosting (XGBoost), support vector machine (SVM), logistic regression (LR), and random forest (RF) models were trained to predict 5-year T2DM risk. Model performances were analyzed using receiver operating characteristic (ROC) curves for discrimination and calibration plots for prediction accuracy. Results Key variables included fasting plasma glucose, age, and sedentary time. The LR model showed good accuracy with respective areas under the ROC (AUCs) of 0.914 and 0.913 in training and validation sets; the RF model exhibited favorable AUCs of 0.998 and 0.838. In calibration analysis, the LR model displayed good fit for low-risk patients; the RF model exhibited satisfactory fit for low- and high-risk patients. Conclusions LR and RF models can effectively predict T2DM risk in the Chinese population. These models may help identify high-risk patients and guide interventions to prevent complications and disabilities.