MLSL-Spell: Chinese Spelling Check Based on Multi-Label Annotation

Liming Jiang; Xingfa Shen; Qingbiao Zhao; Jian Yao

doi:10.3390/app14062541

Applied Sciences (Mar 2024)

MLSL-Spell: Chinese Spelling Check Based on Multi-Label Annotation

Liming Jiang,
Xingfa Shen,
Qingbiao Zhao,
Jian Yao

Affiliations

Liming Jiang: School of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China
Xingfa Shen: School of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China
Qingbiao Zhao: School of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China
Jian Yao: School of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China

DOI: https://doi.org/10.3390/app14062541
Journal volume & issue: Vol. 14, no. 6
p. 2541

Abstract

Read online

Chinese spelling errors are commonplace in our daily lives, which might be caused by input methods, optical character recognition, or speech recognition. Due to Chinese characters’ phonetic and visual similarities, the Chinese spelling check (CSC) is a very challenging task. However, the existing CSC solutions cannot achieve good spelling check performance since they often fail to fully extract the contextual information and Pinyin information. In this paper, we propose a novel CSC framework based on multi-label annotation (MLSL-Spell), consisting of two basic phases: spelling detection and correction. In the spelling detection phase, MLSL-Spell uses the fusion vectors of both character-based pre-trained context vectors and Pinyin vectors and adopts the sequence labeling method to explicitly label the type of misspelled characters. In the spelling correction phase, MLSL-Spell uses Masked Language Mode (MLM) model to generate candidate characters, then performs corresponding screenings according to the error types, and finally screens out the correct characters through the XGBoost classifier. Experiments show that the MLSL-Spell model outperforms the benchmark model. On SIGHAN 2013 dataset, the spelling detection F1 score of MLSL-Spell is 18.3% higher than that of the pointer network (PN) model, and the spelling correction F1 score is 10.9% higher. On SIGHAN 2015 dataset, the spelling detection F1 score of MLSL-Spell is 11% higher than that of Bert and 15.7% higher than that of the PN model. And the spelling correction F1 of MLSL-Spell score is 6.8% higher than that of PN model.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords