JOIN: Jurnal Online Informatika (Apr 2024)

Analysis of Data and Feature Processing on Stroke Prediction using Wide Range Machine Learning Model

  • Untari Novia Wisesty,
  • Tjokorda Agung Budi Wirayuda,
  • Febryanti Sthevanie,
  • Rita Rismala

DOI
https://doi.org/10.15575/join.v9i1.1249
Journal volume & issue
Vol. 9, no. 1
pp. 29 – 40

Abstract

Read online

Stroke is a disease which cause the death of brain cells, so that the part of the body controlled by the brain loses its function. If not treated immediately, this disease can cause long-term disability, brain damage, and death. In this research, stroke prediction was carried out on the Stroke dataset acquired from the Kaggle dataset using various machine learning models. Then, data sampling techniques are used to handle data imbalance problems in the stroke dataset, which include Random Undersampling, Random Oversampling, and SMOTE techniques. Pearson Correlation and Principal Component Analysis are also used for dimensional reduction and analyzing the important features that are most influential in predicting stroke. Pearson Correlation produces five attributes that have the highest Pearson coefficient, namely age, hypertension, heart disease, blood sugar level, and marital status. Experimental results have demonstrated that the utilization of RUS, ROS, and SMOTE sampling techniques can significantly boost the F1-Score testing by an impressive 43.44%, 34.44%, and 35.55% respectively, as compared to experiments conducted without implementing any data sampling techniques. The highest F1-Score testing was achieved using the Support Vector Machine and Gaussian Naïve Bayes models, namely 0.83.

Keywords