Is Chinese Spelling Check ready? Understanding the correction behavior in real-world scenarios
Liner Yang,
Xin Liu,
Tianxin Liao,
Zhenghao Liu,
Mengyan Wang,
Xuezhi Fang,
Erhong Yang
Affiliations
Liner Yang
National Language Resources Monitoring and Research Center for Print Media, Beijing Language and Culture University, Beijing, China; School of Information Science, Beijing Language and Culture University, Beijing, China; Corresponding author at: National Language Resources Monitoring and Research Center for Print Media, Beijing Language and Culture University, Beijing, China.
Xin Liu
National Language Resources Monitoring and Research Center for Print Media, Beijing Language and Culture University, Beijing, China; School of Information Science, Beijing Language and Culture University, Beijing, China
Tianxin Liao
National Language Resources Monitoring and Research Center for Print Media, Beijing Language and Culture University, Beijing, China; School of Information Science, Beijing Language and Culture University, Beijing, China
Zhenghao Liu
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Mengyan Wang
National Language Resources Monitoring and Research Center for Print Media, Beijing Language and Culture University, Beijing, China; School of Information Science, Beijing Language and Culture University, Beijing, China
Xuezhi Fang
National Language Resources Monitoring and Research Center for Print Media, Beijing Language and Culture University, Beijing, China; School of Information Science, Beijing Language and Culture University, Beijing, China
Erhong Yang
National Language Resources Monitoring and Research Center for Print Media, Beijing Language and Culture University, Beijing, China; School of Information Science, Beijing Language and Culture University, Beijing, China
The task of Chinese Spelling Check (CSC) is crucial for identifying and rectifying spelling errors in Chinese texts. While prior work in this domain has predominantly relied on benchmarks such as SIGHAN for evaluating model performance, these benchmarks often exhibit an imbalanced distribution of spelling errors. They are typically constructed under idealized conditions, presuming the presence of only spelling errors in the input text. This assumption does not hold in real-world scenarios, where spell checkers frequently encounter a mix of spelling and grammatical errors, thereby presenting additional challenges. To address this gap and create a more realistic testing environment, we introduce a high-quality CSC evaluation benchmark named YACSC (Yet Another Chinese Spelling Check Dataset). YACSC is unique in that it includes annotations for both grammatical and spelling errors, rendering it a more reliable benchmark for CSC tasks. Furthermore, we propose a hierarchical network designed to integrate multidimensional information, leveraging semantic and phonetic aspects, as well as the structural forms of Chinese characters, to enhance the detection and correction of spelling errors. Through extensive experiments, we evaluate the limitations of existing CSC benchmarks and illustrate the application of our proposed system in real-world scenarios, particularly as a preliminary stage in writing assistant systems.