IEEE Access (Jan 2024)
Comparative Analysis of Machine Learning Algorithms for CKD Risk Prediction
Abstract
Chronic Kidney Disease (CKD) remains a significant global health challenge, with increasing prevalence and a substantial impact on patient quality of life. Early and accurate prediction of CKD risk is crucial for timely intervention and management. This study presents a comprehensive comparative analysis of both machine learning and deep learning algorithms applied to predict CKD risk. The research involved the application of eight traditional machine learning algorithms: Naive Bayes, K-nearest Neighbors, Decision Tree, Random Forest, Support Vector Machine, Logistic Regression, AdaBoost, and XGBoost, each implemented on a CKD dataset retrieved from the UCI data repository. Furthermore, three neural network-based algorithms, Artificial Neural Network (ANN), Simple Recurrent Neural Network (RNN), and Long Short-Term Memory (LSTM) were used to compare to the traditional algorithms. This comparative study not only assessed each algorithm’s performance in terms of accuracy, precision, recall, and F1 score but also examined their computational efficiency and applicability in real-world clinical settings. All eleven algorithms were trained with three versions of the dataset. The first version kept the original unbalance between classes and used KNN imputation to fill up missing values (unbalanced). The second dataset used SMOTENC to create new samples to balance the dataset (balanced). The third dataset used feature selection to choose 14 features from the original 24. The results showed that there is almost no performance difference among the classifiers produced with the balanced, unbalanced and feature selection datasets. This means that the best algorithms for this task are the ones with short training and testing runtime, namely RF, SVM, AdaBoost and XGBoost. The experiments also showed that the neural network-based algorithms had no performance advantage and were slower to train due to the small size of samples available in the original dataset.
Keywords