BMJ Open (May 2023)
Tailored machine learning for evaluating the long-term diabetes risk in older individuals: findings from the Irish Longitudinal Study on Ageing (TILDA)
Abstract
Objectives The prevalence of diabetes has increased globally, leading to a significant disease burden and financial cost. Early prediction is crucial to control its prevalence.Design A prospective cohort study.Setting National representative study on Irish.Participants 8504 individuals aged 50 years or older were included.Primary and secondary outcome measures Surveys were conducted to collect over 40 000 variables related to social, financial, health, mental and family status. Feature selection was performed using logistic regression. Different machine/deep learning algorithms were trained, including distributed random forest, extremely randomised trees, a generalised linear model with regularisation, a gradient boosting machine and a deep neural network. These algorithms were integrated into a stacked ensemble to generate the best model. The model was tested using various metrics, such as the area under the curve (AUC), log loss, mean per classification error, mean square error (MSE) and root MSE (RMSE). The SHapley Additive exPlanations (SHAP) method was used to interpret the established model.Results After 2 years, 105 baseline features were identified as major contributors to diabetes risk, including sex, low-density lipoprotein cholesterol and cirrhosis. The best model achieved high accuracy, robustness and discrimination in predicting diabetes risk, with an AUC of 0.854, log loss of 0.187, mean per classification error of 0.267, RMSE of 0.229 and MSE of 0.052 in the independent test set. The model was also shown to be well calibrated. The SHAP algorithm provided insights into the decision-making process of the model.Conclusions These findings could help physicians in the early identification of high-risk patients and implement targeted interventions to reduce diabetes incidence.