Scientific African (Sep 2023)

Analysis of COVID-19 cases and comorbidities using machine learning algorithms: A case study of the Limpopo Province, South Africa

  • Alexander Boateng,
  • Daniel Maposa,
  • Reshoketswe Mokobane,
  • Timotheus Darikwa,
  • Charles Gyamfi

Journal volume & issue
Vol. 21
p. e01840

Abstract

Read online

This study examined the biological, social, and clinical risk factors for mortality in coronavirus of the year 2019 (COVID-19) hospitalised patients. The population of the study is prone to COVID-19, thus understanding the most common traits and comorbidities of people who were affected is crucial in reducing its consequences. In this study, four supervised machine learning algorithms were implemented and compared to predict the mortality rate based on the explanatory variables across the five districts of Limpopo Province in South Africa. The data was obtained from Limpopo Department of Health. Prediction about the chances of dying from COVID-19 disease was made using logistic regression, random forest, support vector machine, and decision tree algorithms on the dataset of 20,592 records with twenty-one attributes. Due to the imbalanced nature of the data, Random Over-Sampling Examples (ROSE) were employed to balance our data for more accurate classification effectively. The ROSE package provides functions to deal with binary classification problems in the presence of imbalanced classes. We used 70% of the data for training, while 30% was selected for testing the predictive algorithms. A technique called Step Akaike's Information Criterion (StepAIC) was deployed to reduce the insignificant variables from the full model of the logistic regression. According to the findings of the study, among the four algorithms tested, random forest had the highest recall rate for predicting mortality at roughly 79 percent compared to the other three algorithms. Accordingly, we conclude that random forest algorithm is appropriate for predicting the chances of patients dying from COVID-19 based on the attributes of the five districts of Limpopo Province. In terms of the features and their importance, a function called Variable Importance (VarImp) was used to check which of the attributes have predictive power on the outcome variable (discharged status). The findings revealed that age, ever ventilated, ever oxygenated, intensive ward upon admission, Waterberg district, and private facility type were among the risk factors that could be selected for the logistic regression model to predict mortality in hospitalised patients. This implies that special attention should be given to these identified variables. The random forest model is adequate to establish these factors since the findings reveal that a fairly considerable percentage of explained variation would correctly classify 79% of the cases. The novelty of conducting research on the analysis of COVID-19 cases and comorbidities using machine learning algorithms lies in the potential to uncover new insights and patterns that might not be immediately apparent through traditional statistical methods. Machine learning algorithms can quickly and accurately analyze large datasets and identify complex relationships between variables, which could provide valuable information for public health officials and policymakers.

Keywords