Predictive modelling and identification of key risk factors for stroke using machine learning

Ahmad Hassan; Saima Gulzar Ahmad; Ehsan Ullah Munir; Imtiaz Ali Khan; Naeem Ramzan

doi:10.1038/s41598-024-61665-4

Scientific Reports (May 2024)

Predictive modelling and identification of key risk factors for stroke using machine learning

Ahmad Hassan,
Saima Gulzar Ahmad,
Ehsan Ullah Munir,
Imtiaz Ali Khan,
Naeem Ramzan

Affiliations

Ahmad Hassan: Department of Computer Science, COMSATS University Islamabad
Saima Gulzar Ahmad: Department of Computer Science, COMSATS University Islamabad
Ehsan Ullah Munir: Department of Computer Science, COMSATS University Islamabad
Imtiaz Ali Khan: Department of Computer Science, Cardiff School of Technologies
Naeem Ramzan: School of Computing, Engineering and Physical Sciences, University of the West of Scotland

DOI: https://doi.org/10.1038/s41598-024-61665-4
Journal volume & issue: Vol. 14, no. 1
pp. 1 – 23

Abstract

Read online

Abstract Strokes are a leading global cause of mortality, underscoring the need for early detection and prevention strategies. However, addressing hidden risk factors and achieving accurate prediction become particularly challenging in the presence of imbalanced and missing data. This study encompasses three imputation techniques to deal with missing data. To tackle data imbalance, it employs the synthetic minority oversampling technique (SMOTE). The study initiates with a baseline model and subsequently employs an extensive range of advanced models. This study thoroughly evaluates the performance of these models by employing k-fold cross-validation on various imbalanced and balanced datasets. The findings reveal that age, body mass index (BMI), average glucose level, heart disease, hypertension, and marital status are the most influential features in predicting strokes. Furthermore, a Dense Stacking Ensemble (DSE) model is built upon previous advanced models after fine-tuning, with the best-performing model as a meta-classifier. The DSE model demonstrated over 96% accuracy across diverse datasets, with an AUC score of 83.94% on imbalanced imputed dataset and 98.92% on balanced one. This research underscores the remarkable performance of the DSE model, compared to the previous research on the same dataset. It highlights the model's potential for early stroke detection to improve patient outcomes.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal