Scientific Reports (Jan 2024)

Identifying top ten predictors of type 2 diabetes through machine learning analysis of UK Biobank data

  • Moa Lugner,
  • Araz Rawshani,
  • Edvin Helleryd,
  • Björn Eliasson

DOI
https://doi.org/10.1038/s41598-024-52023-5
Journal volume & issue
Vol. 14, no. 1
pp. 1 – 9

Abstract

Read online

Abstract The study aimed to identify the most predictive factors for the development of type 2 diabetes. Using an XGboost classification model, we projected type 2 diabetes incidence over a 10-year horizon. We deliberately minimized the selection of baseline factors to fully exploit the rich dataset from the UK Biobank. The predictive value of features was assessed using shap values, with model performance evaluated via Receiver Operating Characteristic Area Under the Curve, sensitivity, and specificity. Data from the UK Biobank, encompassing a vast population with comprehensive demographic and health data, was employed. The study enrolled 450,000 participants aged 40–69, excluding those with pre-existing diabetes. Among 448,277 participants, 12,148 developed type 2 diabetes within a decade. HbA1c emerged as the foremost predictor, followed by BMI, waist circumference, blood glucose, family history of diabetes, gamma-glutamyl transferase, waist-hip ratio, HDL cholesterol, age, and urate. Our XGboost model achieved a Receiver Operating Characteristic Area Under the Curve of 0.9 for 10-year type 2 diabetes prediction, with a reduced 10-feature model achieving 0.88. Easily measurable biological factors surpassed traditional risk factors like diet, physical activity, and socioeconomic status in predicting type 2 diabetes. Furthermore, high prediction accuracy could be maintained using just the top 10 biological factors, with additional ones offering marginal improvements. These findings underscore the significance of biological markers in type 2 diabetes prediction.