IEEE Access (Jan 2020)

Machine Learning-Based Application for Predicting Risk of Type 2 Diabetes Mellitus (T2DM) in Saudi Arabia: A Retrospective Cross-Sectional Study

  • Asif Hassan Syed,
  • Tabrej Khan

DOI
https://doi.org/10.1109/ACCESS.2020.3035026
Journal volume & issue
Vol. 8
pp. 199539 – 199561

Abstract

Read online

Earlier detection of individuals at the highest risk of developing diabetes is crucial to avoid the disease's prevalence and progression. Therefore, we aim to build a data-driven predictive application for screening subjects at a high risk of developing Type 2 Diabetes mellitus (T2DM) in the western region of Saudi Arabia. In this context, we designed and implemented a questionnaire-based cross-sectional study using conventional diabetes risk factors for studying the prevalence and the association between the outcomes and exposure (s). We used the Chi-Squared test and binary logistic regression to analyze and screen the most significant diabetes risk factor for T2DM risk prediction. Synthetic Minority Over-sampling Technique (SMOTE), a class-balancer, was used to balance the cross-sectional data. We used the balanced class data to screen the best performing classification algorithm to classify patients at high risk of diabetes with a higher F1 Score. The best performing classifier's hyper-parameters were further tuned using 10-fold cross-validation for achieving an improved F1 Score. Additionally, we validated our proposed model with the existing models built using the National Health and Nutrition Examination Survey (NHANES) dataset and Pima Indian Diabetes (PID) dataset. The results of the Chi-squared test and binary logistic regression showed that the exposures, namely Smoking, Healthy diet, Blood-Pressure (BP), Body Mass Index (BMI), Gender, and Region, contributed significantly (p <; 0.05) to the prediction of the Response variable (subjects at high risk of diabetes). The tuned two-class Decision Forest (DF) model showed better performance with an average F1score of 0.8453 ± 0.0268. Moreover, the DF based model adapted reasonably well in different diabetes dataset. An Application Programming Interface (API) of the tuned DF model was implemented and deployed as a web service at https://type2-diabetes-risk-predictor.herokuapp.com, and the implementation codes are available at https://github.com/SAH-ML/T2DM-Risk-Predictor.

Keywords