IEEE Access (Jan 2024)
Empirical Study of Feature Selection Methods in Regression for Large-Scale Healthcare Data: A Case Study on Estimating Dental Expenditures
Abstract
The complexity and high dimensionality of healthcare data present substantial challenges in building machine learning (ML) models, given the large number of variables such as patient demographics and medical history. Effective feature selection is crucial to address issues such as increased computational resource, longer training times, overfitting, and reduced model interpretability, etc. This study evaluates a range of feature selection methods to identify the most impactful features for predicting dental expenditures using publicly available Medical Expenditure Panel Survey (MEPS) data. Sixteen ML models are assessed to determine the top performing model, after which state-of-the-art filter, wrapper, embedded, and hybrid feature selection techniques are applied to optimize the feature set. The highest performance, in terms of coefficient of determination ( $R^{2}$ ), is achieved using a hybrid feature selection method that combines the mutual information filter with the embedded features from the CatBoost regressor. The results indicate that the proposed system is suitable for real-time deployment even with reduced features, providing potential benefits such as minimizing the need for irrelevant and difficult-to-obtain features. Moreover, automated feature selection significantly enhances model performance, yielding a $R^{2}$ score of 0.86, compared to the score 0.73 achieved with carefully selected manual features. Additionally, to enhance the interpretability of the top-performing ML model, explanatory visualizations are employed to examine the influence of key features on predicting dental expenditures.
Keywords