On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction

Min-Wei Huang; Chien-Hung Chiu; Chih-Fong Tsai; Wei-Chao Lin

doi:10.3390/app11146574

Applied Sciences (Jul 2021)

On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction

Min-Wei Huang,
Chien-Hung Chiu,
Chih-Fong Tsai,
Wei-Chao Lin

Affiliations

Min-Wei Huang: Department of Physical Therapy and Graduate Institute of Rehabilitation Science, China Medical University, Taichung 406040, Taiwan
Chien-Hung Chiu: Department of Thoracic Surgery, Chang Gung Memorial Hospital, Linkou 333423, Taiwan
Chih-Fong Tsai: Department of Information Management, National Central University, Taoyuan 320317, Taiwan
Wei-Chao Lin: Department of Thoracic Surgery, Chang Gung Memorial Hospital, Linkou 333423, Taiwan

DOI: https://doi.org/10.3390/app11146574
Journal volume & issue: Vol. 11, no. 14
p. 6574

Abstract

Read online

Breast cancer prediction datasets are usually class imbalanced, where the number of data samples in the malignant and benign patient classes are significantly different. Over-sampling techniques can be used to re-balance the datasets to construct more effective prediction models. Moreover, some related studies have considered feature selection to remove irrelevant features from the datasets for further performance improvement. However, since the order of combining feature selection and over-sampling can result in different training sets to construct the prediction model, it is unknown which order performs better. In this paper, the information gain (IG) and genetic algorithm (GA) feature selection methods and the synthetic minority over-sampling technique (SMOTE) are used for different combinations. The experimental results based on two breast cancer datasets show that the combination of feature selection and over-sampling outperform the single usage of either feature selection and over-sampling for the highly class imbalanced datasets. In particular, performing IG first and SMOTE second is the better choice. For other datasets with a small class imbalance ratio and a smaller number of features, performing SMOTE is enough to construct an effective prediction model.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords