Scientific Reports (May 2023)
Application of supervised machine learning algorithms for classification and prediction of type-2 diabetes disease status in Afar regional state, Northeastern Ethiopia 2021
Abstract
Abstract Ethiopia has been challenged by the growing magnitude of diabetes in general and type-2 diabetes in particular. Knowledge extraction from stored dataset can be an important base for better decision on diabetes rapid diagnosis, suggestive on prediction for early intervention. Thus, this study was addressed these problem by application of supervised machine learning algorithms for classification and prediction of type 2 diabetes disease status and might provide context-specific information to program planners and policy makers so that, priority will be given to the more affected groups. To apply supervised machine learning algorithms; compare these algorithms and select the best algorithm based on their performance for classification and prediction of type-2 diabetic disease status (positive or negative) in public hospitals of Afar regional state, Northeastern Ethiopia. This study was conducted at Afar regional state from February to June of 2021. Decision tree; pruned J 48, Artificial neural network, K-nearest neighbor, Support vector machine, Binary logistic regression, Random forest, and Naïve Bayes supervised machine learning algorithms were applied using secondary data from the medical database record review. A total of 2239 sample Dataset diagnosed for diabetes from 2012 to April 22/2020 (1523 with type-2 diabetes and 716 without type-2 diabetes) was checked for its completeness prior to analysis. For all algorithms, WEKA3.7 tool was used for analysis purposes. Moreover, all algorithms were compared based on their correctly classification rate, kappa statistics, confusion matrix, area under the curve, sensitivity, and specificity. From the seven major supervised machine learning algorithms, the best classification and prediction results were obtained from random forest [correctly classified rate (93.8%), kappa statistics (0.85), sensitivity (0.98), area under the curve (0.97) and confusion matrix (out of 454 actual positive prediction for 446)] which was followed by decision tree pruned J 48 [correctly classified rate (91.8%), kappa statistics (0.80), sensitivity (0.96), area under the curve (0.91) and confusion matrices (out of 454 actual positive prediction for 438)] and k-nearest neighbor [correctly classified rate (89.8%), kappa statistics (0.76), sensitivity (0.92), area under the curve (0.88) and confusion matrices (out of 454 actual positive prediction for 421)]. Random forest, Decision tree pruned J48 and k-nearest neighbor algorithms have better classification and prediction performance for classifying and predicting type-2 diabetes disease status. Therefore, based on this performance, random forest algorithm can be judged as suggestive and supportive for clinicians at the time of type-2 diabetes diagnosis.