Adapting Feature Selection Algorithms for the Classification of Chinese Texts

Xuan Liu; Shuang Wang; Siyu Lu; Zhengtong Yin; Xiaolu Li; Lirong Yin; Jiawei Tian; Wenfeng Zheng

doi:10.3390/systems11090483

Systems (Sep 2023)

Adapting Feature Selection Algorithms for the Classification of Chinese Texts

Xuan Liu,
Shuang Wang,
Siyu Lu,
Zhengtong Yin,
Xiaolu Li,
Lirong Yin,
Jiawei Tian,
Wenfeng Zheng

Affiliations

Xuan Liu: School of Public Affairs and Administration, University of Electronic Science and Technology of China, Chengdu 611731, China
Shuang Wang: School of Automation, University of Electronic Science and Technology of China, Chengdu 610054, China
Siyu Lu: School of Automation, University of Electronic Science and Technology of China, Chengdu 610054, China
Zhengtong Yin: College of Resource and Environment Engineering, Guizhou University, Guiyang 550025, China
Xiaolu Li: School of Geographic Science, Southwest University, Chongqing 400715, China
Lirong Yin: Department of Geography and Anthropology, Louisiana State University, Baton Rouge, LA 70803, USA
Jiawei Tian: School of Automation, University of Electronic Science and Technology of China, Chengdu 610054, China
Wenfeng Zheng: School of Automation, University of Electronic Science and Technology of China, Chengdu 610054, China

DOI: https://doi.org/10.3390/systems11090483
Journal volume & issue: Vol. 11, no. 9
p. 483

Abstract

Read online

Text classification has been highlighted as the key process to organize online texts for better communication in the Digital Media Age. Text classification establishes classification rules based on text features, so the accuracy of feature selection is the basis of text classification. Facing fast-increasing Chinese electronic documents in the digital environment, scholars have accumulated quite a few algorithms for the feature selection for the automatic classification of Chinese texts in recent years. However, discussion about how to adapt existing feature selection algorithms for various types of Chinese texts is still inadequate. To address this, this study proposes three improved feature selection algorithms and tests their performance on different types of Chinese texts. These include an enhanced CHI square with mutual information (MI) algorithm, which simultaneously introduces word frequency and term adjustment (CHMI); a term frequency–CHI square (TF–CHI) algorithm, which enhances weight calculation; and a term frequency–inverse document frequency (TF–IDF) algorithm enhanced with the extreme gradient boosting (XGBoost) algorithm, which improves the algorithm’s ability of word filtering (TF–XGBoost). This study randomly chooses 3000 texts from six different categories of the Sogou news corpus to obtain the confusion matrix and evaluate the performance of the new algorithms with precision and the F1-score. Experimental comparisons are conducted on support vector machine (SVM) and naive Bayes (NB) classifiers. The experimental results demonstrate that the feature selection algorithms proposed in this paper improve performance across various news corpora, although the best feature selection schemes for each type of corpus are different. Further studies of the application of the improved feature selection methods in other languages and the improvement in classifiers are suggested.

Published in Systems

ISSN: 2079-8954 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General): Systems engineering; Technology: Technology (General)
Website: http://www.mdpi.com/journal/systems

About the journal

Abstract

Keywords