Informatics in Medicine Unlocked (Jan 2022)

Advanced hybrid ensemble gain ratio feature selection model using machine learning for enhanced disease risk prediction

  • Syed Javeed Pasha,
  • E. Syed Mohamed

Journal volume & issue
Vol. 32
p. 101064

Abstract

Read online

Currently, there is an increased need for employing machine learning (ML) and data mining in the healthcare system domain, applications of which play a pivotal role in providing beneficial knowledge to society by utilizing the available data. There are disease risk prediction models that either do not use feature selection or use traditional feature selection models—filter, wrapper, and embedded—which have limitations such as, not using ML, using single evaluation metric, and working on imbalanced datasets. Apart from this, there is a great scope for enhancement in the prediction performance. To address these issues, an advanced hybrid ensemble gain ratio feature selection (AHEG-FS) model is proposed, which consists of four major feature selection techniques: an ensemble feature selection, a gain ratio feature selection, a backward feature elimination, and an area under the curve (AUC)—an additional evaluation metric of the novel feature reduction along with accuracy, for robust feature selection aimed at effective disease risk prediction. The subsets of important and highly ranked features are obtained with the first two techniques. Then, the proposed model is aligned with nine ML algorithms. Additionally, in the third and fourth techniques of the proposed model, the AUCs are evaluated for the aforementioned ML algorithms and the backward feature elimination is applied to remove the redundant features, resulting in the acquisition of the best subsets of highly contributing features that produce the highest precision results. Thus, four benchmarked heart disease datasets—Cleveland, Hungarian, Statlog, and Switzerland of the University of California, Irvine ML repository—are used, and the results are encouraging. The highest AUC and accuracy of 99.00% and 95.47%, respectively, are achieved, with 46.15% of features reduced. A 6.18% higher accuracy than recent studies was achieved with convergent speed.

Keywords