Exploring Early Prediction of Chronic Kidney Disease Using Machine Learning Algorithms for Small and Imbalanced Datasets

Andressa C. M. da Silveira; Álvaro Sobrinho; Leandro Dias da Silva; Evandro de Barros Costa; Maria Eliete Pinheiro; Angelo Perkusich

doi:10.3390/app12073673

Applied Sciences (Apr 2022)

Exploring Early Prediction of Chronic Kidney Disease Using Machine Learning Algorithms for Small and Imbalanced Datasets

Andressa C. M. da Silveira,
Álvaro Sobrinho,
Leandro Dias da Silva,
Evandro de Barros Costa,
Maria Eliete Pinheiro,
Angelo Perkusich

Affiliations

Andressa C. M. da Silveira: Electrical Engineering Department, Federal University of Campina Grande, Campina Grande 58428-830, Brazil
Álvaro Sobrinho: Computer Science, Federal University of the Agreste of Pernambuco, Garanhuns 55292-270, Brazil
Leandro Dias da Silva: Computing Institute, Federal University of Alagoas, Maceió 57072-900, Brazil
Evandro de Barros Costa: Faculty of Medicine, Federal University of Alagoas, Maceió 57072-900, Brazil
Maria Eliete Pinheiro: Faculty of Medicine, Federal University of Alagoas, Maceió 57072-900, Brazil
Angelo Perkusich: Virtus Research, Development and Innovation Center, Federal University of Campina Grande, Campina Grande 58428-830, Brazil

DOI: https://doi.org/10.3390/app12073673
Journal volume & issue: Vol. 12, no. 7
p. 3673

Abstract

Read online

Chronic kidney disease (CKD) is a worldwide public health problem, usually diagnosed in the late stages of the disease. To alleviate such issue, investment in early prediction is necessary. The purpose of this study is to assist the early prediction of CKD, addressing problems related to imbalanced and limited-size datasets. We used data from medical records of Brazilians with or without a diagnosis of CKD, containing the following attributes: hypertension, diabetes mellitus, creatinine, urea, albuminuria, age, gender, and glomerular filtration rate. We present an oversampling approach based on manual and automated augmentation. We experimented with the synthetic minority oversampling technique (SMOTE), Borderline-SMOTE, and Borderline-SMOTE SVM. We implemented models based on the algorithms: decision tree (DT), random forest, and multi-class AdaBoosted DTs. We also applied the overall local accuracy and local class accuracy methods for dynamic classifier selection; and the k-nearest oracles-union, k-nearest oracles-eliminate, and META-DES for dynamic ensemble selection. We analyzed the models’ performances using the hold-out validation, multiple stratified cross-validation (CV), and nested CV. The DT model presented the highest accuracy score (98.99%) using the manual augmentation and SMOTE. Our approach can assist in designing systems for the early prediction of CKD using imbalanced and limited-size datasets.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords