Computer Methods and Programs in Biomedicine Update (Jan 2021)

Prediction of diabetes using logistic regression and ensemble techniques

  • Priyanka Rajendra,
  • Shahram Latifi

Journal volume & issue
Vol. 1
p. 100032

Abstract

Read online

Background: Logistic regression is a classification model in machine learning, extensively used in clinical analysis. It uses probabilistic estimations which helps in understanding the relationship between the dependent variable and one or more independent variables. Diabetes, being one of the most common diseases around the world, when detected early, may prevent the progression of the disease and avoid other complications. In this work, we design a prediction model, that predicts whether a patient has diabetes, based on certain diagnostic measurements included in the dataset, and explore various techniques to boost the performance and accuracy. Methods: Logistic Regression is the main algorithm used in this paper and the analysis is carried out using Python IDE. The experiment mainly uses two datasets – one is the PIMA Indians Diabetes dataset, which is originally from the National Institute of Diabetes and Digestive and Kidney Diseases, and the other dataset is from Vanderbilt, which is based on a study of rural African Americans in Virginia. Feature selection is carried out using two different methods. Ensemble methods are further used, that improve performance by producing better predictions compared to a single model. Results: The accuracy and runtimes are captured for the original datasets and also for the ones obtained after using feature selection and ensemble techniques. A comparison is also shown in each case. The highest accuracy obtained was around 78% for Dataset 1, after employing the ensemble technique- Max Voting; and it was around 93% for Dataset 2, after using the ensemble techniques- Max Voting, and Stacking. Conclusion: Logistic Regression has shown to be one of the efficient algorithms in building prediction models. This study also shows that apart from the choice of algorithms, there are other factors that could improve the accuracy and runtimes of the model, such as: data-preprocessing, removal of redundant and null values, normalization, cross-validation, feature selection, and usage of ensemble techniques.

Keywords