Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods

Huanjing Wang; Qianxin Liang; John T. Hancock; Taghi M. Khoshgoftaar

doi:10.1186/s40537-024-00905-w

Journal of Big Data (Mar 2024)

Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods

Huanjing Wang,
Qianxin Liang,
John T. Hancock,
Taghi M. Khoshgoftaar

Affiliations

Huanjing Wang: Ogden College of Science and Engineering, Western Kentucky University
Qianxin Liang: College of Engineering and Computer Science, Florida Atlantic University
John T. Hancock: College of Engineering and Computer Science, Florida Atlantic University
Taghi M. Khoshgoftaar: College of Engineering and Computer Science, Florida Atlantic University

DOI: https://doi.org/10.1186/s40537-024-00905-w
Journal volume & issue: Vol. 11, no. 1
pp. 1 – 16

Abstract

Read online

Abstract In the context of high-dimensional credit card fraud data, researchers and practitioners commonly utilize feature selection techniques to enhance the performance of fraud detection models. This study presents a comparison in model performance using the most important features selected by SHAP (SHapley Additive exPlanations) values and the model’s built-in feature importance list. Both methods rank features and choose the most significant ones for model assessment. To evaluate the effectiveness of these feature selection techniques, classification models are built using five classifiers: XGBoost, Decision Tree, CatBoost, Extremely Randomized Trees, and Random Forest. The Area under the Precision-Recall Curve (AUPRC) serves as the evaluation metric. All experiments are executed on the Kaggle Credit Card Fraud Detection Dataset. The experimental outcomes and statistical tests indicate that feature selection methods based on importance values outperform those based on SHAP values across classifiers and various feature subset sizes. For models trained on larger datasets, it is recommended to use the model’s built-in feature importance list as the primary feature selection method over SHAP. This suggestion is based on the rationale that computing SHAP feature importance is a distinct activity, while models naturally provide built-in feature importance as part of the training process, requiring no additional effort. Consequently, opting for the model’s built-in feature importance list can offer a more efficient and practical approach for larger datasets and more intricate models.

Published in Journal of Big Data

ISSN: 2196-1115 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://journalofbigdata.springeropen.com

About the journal

Abstract

Keywords