Empirical Comparison of the Feature Evaluation Methods Based on Statistical Measures

Adam Lysiak; Miroslaw Szmajda

doi:10.1109/ACCESS.2021.3058428

IEEE Access (Jan 2021)

Empirical Comparison of the Feature Evaluation Methods Based on Statistical Measures

Adam Lysiak,
Miroslaw Szmajda

Affiliations

Adam Lysiak: ORCiD; Faculty of Electrical Engineering, Automatic Control and Informatics, Opole University of Technology, Opole, Poland
Miroslaw Szmajda: ORCiD; Faculty of Electrical Engineering, Automatic Control and Informatics, Opole University of Technology, Opole, Poland

DOI: https://doi.org/10.1109/ACCESS.2021.3058428
Journal volume & issue: Vol. 9
pp. 27868 – 27883

Abstract

Read online

One of the most important classification problems is selecting proper features, i.e. features that describe the classified object in the most straightforward way possible. Then, one of the biggest challenges of the feature selection is the evaluation of the feature’s quality. There is a plethora of feature evaluation methods in the literature. This paper presents the results of a comparison between nine selected feature evaluation methods, both existing in literature and newly defined. To make a comparison, features from ten various sets were evaluated by every method. Then, from every feature set, best subset (according to each method) was chosen. Those subsets then were used to train a set of classifiers (including decision trees and forests, linear discriminant analysis, naive Bayes, support vector machines, k nearest neighbors and an artificial neural network). The maximum accuracy of those classifiers, as well as the standard deviation between their accuracies, were used as a quality measures of each particular method. Furthermore, it was determined, which method is the most universal in terms of the data set, i.e. for which method, obtained accuracies were dependent on the feature set the least. Finally, computation time of each method was compared. Results indicated that for applications with limited computational power, method based on the average overlap between feature’s values seem best suited. It led to high accuracies and proved to be fast to compute. However, if the data set is known to be normally distributed, method based on two-sample ${t}$ -test may be preferable.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords