Alexandria Engineering Journal (Nov 2024)
Optimization of multidimensional feature engineering and data partitioning strategies in heart disease prediction models
Abstract
The relentless rise in heart disease incidence, a leading global cause of death, presents a significant public health challenge. Precise prediction of heart disease risk and early interventions are crucial. This study investigates the performance improvement of heart disease prediction models using machine learning and deep learning algorithms. Initially, we utilized the Heart Failure Prediction Dataset from Kaggle. After preprocessing to ensure data quality, three distinct feature engineering techniques were applied: PCA for dimensionality reduction, ET for feature selection, and Pearson's correlation coefficient for feature selection. We assessed their impact on model performance. The dataset was then partitioned into three different data split ratios—1:9, 2:8, and 3:7—to determine their specific effects on model performance. Twelve machine learning classifiers—LGBM, Adaboost, XGB, RF, DT, KNN, LR, GNB, ET, SVC, GB, and Bagging—were trained and evaluated based on five key metrics: accuracy, recall, precision, F1 score, and training time. The influence of different feature engineering methods and data partitioning ratios on model performance were systematically analyzed using paired-sample t-tests. Among the feature engineering methods compared, the Bagging classifier, when combined with feature selection via ET, exhibited superior performance. It achieved an accuracy of 97.48 % and an F1-Score of 97.48 % with a data split ratio of 1:9 between the test and training sets. With a 2:8 split, the accuracy was 94.96 % and the F1-Score was 94.95 %. For a 3:7 split, the accuracy was 94.12 % and the F1-Score was 94.11 %. Paired sample T-test results indicate that feature selection using Pearson correlation coefficient can shorten training duration, but this also leads to a decline in classifier performance. After applying PCA dimensionality reduction, compared with the control group, there was no significant difference in the training efficiency and efficacy of the classifier. However, feature selection through ET significantly reduced the training time for various classifiers while ensuring their performance.