IEEE Access (Jan 2024)
Accurate Cardiovascular Disease Prediction: Leveraging Opt_hpLGBM With Dual-Tier Feature Selection
Abstract
Reliable forecasting of cardiovascular disease (CVD) outcomes is crucial for efficient patient management. While machine learning (ML) holds promise for disease prediction, challenges arise, particularly with smaller clinical datasets. Feature engineering is essential in this context, as it involves analyzing missing values, managing outliers, and addressing multicollinearity. This process is key to identifying and eliminating unnecessary features from the dataset. To tackle this, a scalable ML based Dual-Tier feature selection framework called ANOVA Chi-Squared (AnoX2) is proposed, utilizing a hybrid statistical method. The framework integrates validation using five different ML classifiers with the selected features from AnoX2. The proposed model Opt_hpLGBM (Optuna hyperparameter tuned Light Gradient Boost Machine) along with AnoX2 feature selection exhibits outstanding performance across four publicly available datasets, consistently achieving remarkable accuracy. For instance, it achieves 94.87% accuracy in the Cleveland dataset with 8 features, 95.12% in the Statlog dataset with the same number of features, 92.81% accuracy with 7 features in the heart disease dataset, and an impressive 98.85% accuracy in the z-Alizadeh Sani dataset with 12 features. These results exceed current benchmarks, establishing it as an industry leader in terms of the number of features utilized, accuracy, precision, recall, F1 score, and log loss metrics. With its potential for early diagnosis and treatment, this innovative framework can transform healthcare, significantly reducing mortality rates associated with cardiovascular disease.
Keywords