Symmetry (Sep 2022)
A Text Classification Model via Multi-Level Semantic Features
Abstract
Text classification is a major task of NLP (Natural Language Processing) and has been the focus of attention for years. News classification as a branch of text classification is characterized by complex structure, large amounts of information and long text length, which in turn leads to a decrease in the accuracy of classification. To improve the classification accuracy of Chinese news texts, we present a text classification model based on multi-level semantic features. First, we add the category correlation coefficient to TF-IDF (Term Frequency-Inverse Document Frequency) and the frequency concentration coefficient to CHI (Chi-Square), and extract the keyword semantic features with the improved algorithm. Then, we extract local semantic features with TextCNN with symmetric-channel and global semantic information from a BiLSTM with attention. Finally, we fuse the three semantic features for the prediction of text categories. The results of experiments on THUCNews, LTNews and MCNews show that our presented method is highly accurate, with 98.01%, 90.95% and 94.24% accuracy, respectively. With model parameters two magnitudes smaller than Bert, the improvements relative to the baseline Bert+FC are 1.27%, 1.2%, and 2.81%, respectively.
Keywords