The Feature Selection Effect on Missing Value Imputation of Medical Datasets

Chia-Hui Liu; Chih-Fong Tsai; Kuen-Liang Sue; Min-Wei Huang

doi:10.3390/app10072344

Applied Sciences (Mar 2020)

The Feature Selection Effect on Missing Value Imputation of Medical Datasets

Chia-Hui Liu,
Chih-Fong Tsai,
Kuen-Liang Sue,
Min-Wei Huang

Affiliations

Chia-Hui Liu: Department of Nursing, Ditmanson Medical Foundation Chia-Yi Christian Hospital, Chiayi 60002, Taiwan
Chih-Fong Tsai: Department of Information Management, National Central University, Taoyuan 320, Taiwan
Kuen-Liang Sue: Department of Information Management, National Central University, Taoyuan 320, Taiwan
Min-Wei Huang: School of Medicine, China Medical University, Taichung 404, Taiwan

DOI: https://doi.org/10.3390/app10072344
Journal volume & issue: Vol. 10, no. 7
p. 2344

Abstract

Read online

In practice, many medical domain datasets are incomplete, containing a proportion of incomplete data with missing attribute values. Missing value imputation can be performed to solve the problem of incomplete datasets. To impute missing values, some of the observed data (i.e., complete data) are generally used as the reference or training set, and then the relevant statistical and machine learning techniques are employed to produce estimations to replace the missing values. Since the collected dataset usually contains a certain number of feature dimensions, it is useful to perform feature selection for better pattern recognition. Therefore, the aim of this paper is to examine the effect of performing feature selection on missing value imputation of medical datasets. Experiments are carried out on five different medical domain datasets containing various feature dimensions. In addition, three different types of feature selection methods and imputation techniques are employed for comparison. The results show that combining feature selection and imputation is a better choice for many medical datasets. However, the feature selection algorithm should be carefully chosen in order to produce the best result. Particularly, the genetic algorithm and information gain models are suitable for lower dimensional datasets, whereas the decision tree model is a better choice for higher dimensional datasets.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords