Informatics in Medicine Unlocked (Jan 2019)

Improved logistic regression model for diabetes prediction by integrating PCA and K-means techniques

  • Changsheng Zhu,
  • Christian Uwa Idemudia,
  • Wenfang Feng

Journal volume & issue
Vol. 17

Abstract

Read online

Diabetes causes a large number of deaths each year and a large number of people living with the disease do not realize their health condition early enough. In this study, we propose a data mining based model for early diagnosis and prediction of diabetes using the Pima Indians Diabetes dataset. Although K-means is simple and can be used for a wide variety of data types, it is quite sensitive to initial positions of cluster centers which determine the final cluster result, which either provides a sufficient and efficiently clustered dataset for the logistic regression model, or gives a lesser amount of data as a result of incorrect clustering of the original dataset, thereby limiting the performance of the logistic regression model. Our main goal was to determine ways of improving the k-means clustering and logistic regression accuracy result. Our model comprises of PCA (principal component analysis), k-means and logistic regression algorithm. Experimental results show that PCA enhanced the k-means clustering algorithm and logistic regression classifier accuracy versus the result of other published studies, with a k-means output of 25 more correctly classified data, and a logistic regression accuracy of 1.98% higher. As such, the model is shown to be useful for automatically predicting diabetes using patient electronic health records data. A further experiment with a new dataset showed the applicability of our model for the predication of diabetes. Keywords: PCA, K-means, Diabetes, Data mining, Logistic regression