Algorithms (Apr 2024)
Strategic Machine Learning Optimization for Cardiovascular Disease Prediction and High-Risk Patient Identification
Abstract
Despite medical advancements in recent years, cardiovascular diseases (CVDs) remain a major factor in rising mortality rates, challenging predictions despite extensive expertise. The healthcare sector is poised to benefit significantly from harnessing massive data and the insights we can derive from it, underscoring the importance of integrating machine learning (ML) to improve CVD prevention strategies. In this study, we addressed the major issue of class imbalance in the Behavioral Risk Factor Surveillance System (BRFSS) 2021 heart disease dataset, including personal lifestyle factors, by exploring several resampling techniques, such as the Synthetic Minority Oversampling Technique (SMOTE), Adaptive Synthetic Sampling (ADASYN), SMOTE-Tomek, and SMOTE-Edited Nearest Neighbor (SMOTE-ENN). Subsequently, we trained, tested, and evaluated multiple classifiers, including logistic regression (LR), decision trees (DTs), random forest (RF), gradient boosting (GB), XGBoost (XGB), CatBoost, and artificial neural networks (ANNs), comparing their performance with a primary focus on maximizing sensitivity for CVD risk prediction. Based on our findings, the hybrid resampling techniques outperformed the alternative sampling techniques, and our proposed implementation includes SMOTE-ENN coupled with CatBoost optimized through Optuna, achieving a remarkable 88% rate for recall and 82% for the area under the receiver operating characteristic (ROC) curve (AUC) metric.
Keywords