IEEE Access (Jan 2021)
Memory-Efficient, Accurate and Early Diagnosis of Diabetes Through a Machine Learning Pipeline Employing Crow Search-Based Feature Engineering and a Stacking Ensemble
Abstract
The early diagnosis of diabetes helps in avoiding the major risks associated with the disorder. The proposed research involves the design of a machine learning pipeline which generates the most representative feature subset of minimal size that predicts the onset of Diabetes with highest accuracy. It employs a novel diabetes dataset which is gender-neutral and representative enough unlike the well-known PID dataset. The machine learning pipelines involve multiple feature engineering pipelines to generate a reduced feature subset which is fed into multiple heterogeneous classifiers. The feature engineering involves feature selection as well as feature extraction. The former uses the ANOVA filter and Crow Search Optimization algorithm. The latter employs the Singular Value Decomposition. The classification is performed on the preprocessed dataset using a wide range of heterogeneous classifiers like Naive Bayes’, Logistic Regression, K-Nearest Neighbor, Decision Trees, Support Vector Machine, Random Forest, AdaBoost, and GradientBoost as base learners followed by their stacking ensemble. The performance evaluation of each machine learning pipeline is done through Repeated Stratified K-fold Cross Validation using the metrics of accuracy, precision, recall, F1 Score and area under Receiver Operating Characteristic curve. For each pipeline, the number of features in the preprocessed dataset varies and the highest accuracy of 98.4% is achieved with Crow Search algorithm through a stacking ensemble of multiple heterogeneous classifiers. A comparative analysis with a recent related work on the same dataset shows that the proposed feature engineering pipelines with the same set of classifiers outperform with improved accuracy using a feature set of reduced size.
Keywords