Six-Granularity Based Chinese Short Text Classification

Xinjie Sun; Zhifang Liu; Xingying Huo

doi:10.1109/ACCESS.2023.3265712

IEEE Access (Jan 2023)

Six-Granularity Based Chinese Short Text Classification

Xinjie Sun,
Zhifang Liu,
Xingying Huo

Affiliations

Xinjie Sun: ORCiD; Institute of Computer Science, Liupanshui Normal University, Liupanshui, China
Zhifang Liu: Institute of Computer Science, Liupanshui Normal University, Liupanshui, China
Xingying Huo: Institute of Computer Science, Liupanshui Normal University, Liupanshui, China

DOI: https://doi.org/10.1109/ACCESS.2023.3265712
Journal volume & issue: Vol. 11
pp. 35841 – 35852

Abstract

Read online

Short text classification is an important task in Natural Language Processing (NLP). The classification result for Chinese short text is always not ideal due to the sparsity problem of them. Most of the previous classification models for Chinese short text are based on word or character, considering that Chinese radical can also represent the meaning individually, so word, character and radical are all used to build a Chinese short text classification model in this paper, which solves the data sparsity problem of short text. In addition, in the process of segmenting sentences into words, considering that jieba will cause the loss of key information and ngram will generate noise words, both jieba and ngram are used to construct a six-granularity (i.e. word-jieba, word-jieba-radical, word-ngram, word-ngram-radical, character and character-radical) based Chinese short text classification (SGCSTC) model. Additionally, different weights are assigned to the six granularities and are automatically updated in the process of back-propagation using cross-entropy loss due to the different influence of them on the classification results. The classification Accuracy, Precision, Recall and F1 of SGCSTC in THUCNews-S dataset are 93.36%, 94.47%, 94.15% and 94.31% respectively, and that in CNT dataset are 92.67%, 92.38%, 93.15% and 92.76% respectively, and multiple comparative experiment results on THUCNews-S and CNT datasets show that SGCSTC outperforms the state-of-the-art text classification models.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords