Automatic construction of real‐world‐based typing‐error test dataset

Jung‐Hun Lee; Hyuk‐Chul Kwon

doi:10.1049/ell2.12515

Electronics Letters (Jul 2022)

Automatic construction of real‐world‐based typing‐error test dataset

Jung‐Hun Lee,
Hyuk‐Chul Kwon

Affiliations

Jung‐Hun Lee: The Grand ICT Research Centre in Pusan National University Busan South Korea
Hyuk‐Chul Kwon: Department of Information Computer Science Pusan National University Busan South Korea

DOI: https://doi.org/10.1049/ell2.12515
Journal volume & issue: Vol. 58, no. 14
pp. 548 – 550

Abstract

Read online

Abstract In this study, we aim to automatically construct a test dataset for testing the performance of spelling error correction systems. The Google Web 1T corpus, which includes data on 10 quadrillion phrases, is used for this purpose. Therefore, error words used in the test dataset use error words generated by real web users. There are seven types of error words. In order to obtain the error word, a word set that appears simultaneously with the surrounding context (3‐g range) of the location of the error word generation is searched. In this calculation, we exclude error words with wide edit distances that cause the resolution of original words to become exceedingly difficult. In order to select the final error word from the word set, a word with a high value is selected by calculating the context probability using 3‐g. In the experiment, the performance was measured for two systems (grammarly, MS Word) in service and the recently announced spelling error correction system (Neuspell). The highest performance was the F1 score of 56%, which shows the overall performance, indicating the need for research on spelling errors.

Published in Electronics Letters

ISSN: 0013-5194 (Print); 1350-911X (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ietresearch.onlinelibrary.wiley.com/journal/1350911X

About the journal