IEEE Access (Jan 2024)
A New Filter Feature Selection Method for Text Classification
Abstract
Massively amounts of text data have been created on the Internet due to the widespread use of platforms like social media. Text classification is one of the most frequently used techniques for extracting useful information from text data. One of the most fundamental problems in text classification is high dimensionality. In text classification, high dimensionality greatly reduces the success of classifiers while increasing their computational cost. The most effective way to overcome this problem is to select a subset of features comprising the most distinctive features across the entire feature space, with the help of a feature selector. This study presents a new filter feature selection approach called Multivariate Feature Selector (MFS) for text classification. The proposed approach calculates a score for each feature based on three knowledge structures: class-based, document-based, and document-class-based. These structures have been utilized to reveal hidden information at the class, document, and document-class levels. This enables a more precise and effective scoring calculation for each term. The proposed method (MFS) was tested on four different datasets, and micro-F1 and macro-F1 measures were used as performance evaluators to prove the method’s success in feature selection. It has been observed that MFS outperforms the main feature selection methods in the literature. While different classification results were obtained depending on the selected feature size, MFS showed superior performance in all selected sub-feature spaces.
Keywords