Applied Sciences (Mar 2024)
MLSL-Spell: Chinese Spelling Check Based on Multi-Label Annotation
Abstract
Chinese spelling errors are commonplace in our daily lives, which might be caused by input methods, optical character recognition, or speech recognition. Due to Chinese characters’ phonetic and visual similarities, the Chinese spelling check (CSC) is a very challenging task. However, the existing CSC solutions cannot achieve good spelling check performance since they often fail to fully extract the contextual information and Pinyin information. In this paper, we propose a novel CSC framework based on multi-label annotation (MLSL-Spell), consisting of two basic phases: spelling detection and correction. In the spelling detection phase, MLSL-Spell uses the fusion vectors of both character-based pre-trained context vectors and Pinyin vectors and adopts the sequence labeling method to explicitly label the type of misspelled characters. In the spelling correction phase, MLSL-Spell uses Masked Language Mode (MLM) model to generate candidate characters, then performs corresponding screenings according to the error types, and finally screens out the correct characters through the XGBoost classifier. Experiments show that the MLSL-Spell model outperforms the benchmark model. On SIGHAN 2013 dataset, the spelling detection F1 score of MLSL-Spell is 18.3% higher than that of the pointer network (PN) model, and the spelling correction F1 score is 10.9% higher. On SIGHAN 2015 dataset, the spelling detection F1 score of MLSL-Spell is 11% higher than that of Bert and 15.7% higher than that of the PN model. And the spelling correction F1 of MLSL-Spell score is 6.8% higher than that of PN model.
Keywords