Jordanian Journal of Computers and Information Technology (Mar 2024)

ARABIC SOFT SPELLING CORRECTION WITH T5

  • Ola Arif Jaafar,
  • Mohammed Al-Qaraghuli

DOI
https://doi.org/10.5455/jjcit.71-1699768515
Journal volume & issue
Vol. 10, no. 1
pp. 46 – 57

Abstract

Read online

Spelling correction is considered a challenging task for resource-scarce languages. The Arabic language is one of these resource-scarce languages, which suffers from the absence of a large spelling correction dataset, thus datasets injected with artificial errors are used to overcome this problem. In this paper, we trained the Text-to-Text Transfer Transformer (T5) model using artificial errors to correct Arabic soft spelling mistakes. Our T5 model can correct 97.8% of the artificial errors that were injected into the test set. Additionally, our T5 model achieves a character error rate (CER) of 0.77% on a set that contains real soft spelling mistakes. We achieved these results using a 4-layer T5 model trained with a 90% error injection rate, with a maximum sequence length of 300 characters. [JJCIT 2024; 10(1.000): 46-57]

Keywords