Electronics Letters (Jul 2022)
Automatic construction of real‐world‐based typing‐error test dataset
Abstract
Abstract In this study, we aim to automatically construct a test dataset for testing the performance of spelling error correction systems. The Google Web 1T corpus, which includes data on 10 quadrillion phrases, is used for this purpose. Therefore, error words used in the test dataset use error words generated by real web users. There are seven types of error words. In order to obtain the error word, a word set that appears simultaneously with the surrounding context (3‐g range) of the location of the error word generation is searched. In this calculation, we exclude error words with wide edit distances that cause the resolution of original words to become exceedingly difficult. In order to select the final error word from the word set, a word with a high value is selected by calculating the context probability using 3‐g. In the experiment, the performance was measured for two systems (grammarly, MS Word) in service and the recently announced spelling error correction system (Neuspell). The highest performance was the F1 score of 56%, which shows the overall performance, indicating the need for research on spelling errors.