Tehnički Vjesnik (Jan 2022)
A Study on the Features Selection Algorithm Based on the Measurement Method of the Distance Between Normal Distributions for Classification in Machine Learning
Abstract
Feature selection is an important technique that simplifies machine learning models to easily understand, shorten learning time, and reduce curve over-fitting or under-fitting. This paper presents a shape selection algorithm based on a method of investigating similarities between sampled shape values for classification variables (classes). This is based on the premise that the lower the similarity, the higher the usefulness of class classification. The confidence interval of a normal distribution is used to measure similarity. It is judged that the more overlapping the confidence intervals, the higher the similarity. The smaller the duplication of the confidence interval, the lower the similarity, and if the similarity is low, it can be used as a criterion for classification. Therefore, I propose an equation to apply this method. To confirm the usefulness of the equation, a colorectal cancer dataset with about 2000 genes was used and comparative experiments were performed with other feature selection algorithms. The comparison algorithms were Gini Index (10 features), mRMR (10 features), and relational matrix algorithms (7 features). Artificial neural networks were generally used as machine learning algorithms, and comparative verification was performed based on the rib one-out cross-validation method. As a result of the experiment, the results of the Gini index (85.487%), mRMR (87.09%), and relational matrix algorithms (87.09%) were better than those of 88.71% by selecting 10 features. In addition, experiments on iris, wine, glass, music emotions, seeds, and Japanese collection datasets were conducted on multiple classification problems. In the case of wine, the accuracy was 98.8% when all functions were used, but six functions were removed, resulting in 99.4% accuracy. In the case of music sensitivity, the accuracy was 51.7% when all 54 features were used, but when 20 features were removed, it improved to 61.3%. In the case of seeds, it was found that when the number of seeds decreased from 7 to 5, it slightly improved from 93.3% to 93.8%. In the case of iris, glass, and Japanese vowels, the accuracy did not increase even though the function was removed. Therefore, it can be concluded that features can be easily and effectively selected from the multi-class classification problem using the method proposed in this paper.
Keywords