IEEE Access (Jan 2019)
Improving Term Weighting Schemes for Short Text Classification in Vector Space Model
Abstract
Short text is one of the predominant forms of communication with unique characteristics such as short length, high sparsity, and lack of shared context and word co-occurrence. These characteristics distinguish short text from general text and make short text classification a challenging task. Term weighting is an important pre-processing step for text classification in the vector space model. In this paper, we propose three modifications to existing state-of-the-art term weighting schemes: ifn-tp-icf, RFR and modOR and a new term weighting scheme: ifn-modRF. We compare the proposed schemes with ten existing unsupervised and supervised schemes using three datasets of informally written short text: a self-labelled dataset of real-world events from Twitter, a Yahoo! questions dataset and a dataset of product reviews. Based on the experimental results using three popular classifiers, we observe that the proposed scheme ifn-modRF achieves the best F1-scores on the Twitter dataset, while the proposed modification modOR is a consistent performer with the best scores in most of the experiments. The proposed modification ifn-tp-icf also outperform the original scheme in most experiments.
Keywords