Frontiers in Endocrinology (Feb 2024)

Predictive model and risk analysis for peripheral vascular disease in type 2 diabetes mellitus patients using machine learning and shapley additive explanation

  • Lianhua Liu,
  • Bo Bi,
  • Li Cao,
  • Mei Gui,
  • Feng Ju

DOI
https://doi.org/10.3389/fendo.2024.1320335
Journal volume & issue
Vol. 15

Abstract

Read online

BackgroundPeripheral vascular disease (PVD) is a common complication in patients with type 2 diabetes mellitus (T2DM). Early detection or prediction the risk of developing PVD is important for clinical decision-making.PurposeThis study aims to establish and validate PVD risk prediction models and perform risk factor analysis for PVD in patients with T2DM using machine learning and Shapley Additive Explanation(SHAP) based on electronic health records.MethodsWe retrospectively analyzed the data from 4,372 inpatients with diabetes in a hospital between January 1, 2021, and March 28, 2023. The data comprised demographic characteristics, discharge diagnoses and biochemical index test results. After data preprocessing and feature selection using Recursive Feature Elimination(RFE), the dataset was split into training and testing sets at a ratio of 8:2, with the Synthetic Minority Over-sampling Technique(SMOTE) employed to balance the training set. Six machine learning(ML) algorithms, including decision tree (DT), logistic regression (LR), random forest (RF), support vector machine(SVM),extreme gradient boosting (XGBoost) and Adaptive Boosting(AdaBoost) were applied to construct PVD prediction models. A grid search with 10-fold cross-validation was conducted to optimize the hyperparameters. Metrics such as accuracy, precision, recall, F1-score, G-mean, and the area under the receiver operating characteristic curve (AUC) assessed the models’ effectiveness. The SHAP method interpreted the best-performing model.ResultsRFE identified the optimal 12 predictors. The XGBoost model outperformed other five ML models, with an AUC of 0.945, G-mean of 0.843, accuracy of 0.890, precision of 0.930, recall of 0.927, and F1-score of 0.928. The feature importance of ML models and SHAP results indicated that Hemoglobin (Hb), age, total bile acids (TBA) and lipoprotein(a)(LP-a) are the top four important risk factors for PVD in T2DM.ConclusionThe machine learning approach successfully developed a PVD risk prediction model with good performance. The model identified the factors associated with PVD and offered physicians an intuitive understanding on the impact of key features in the model.

Keywords