Correcting spelling mistakes in Persian texts with rules and deep learning methods

Sa. Kasmaiee; Si. Kasmaiee; M. Homayounpour

doi:10.1038/s41598-023-47295-2

Scientific Reports (Nov 2023)

Correcting spelling mistakes in Persian texts with rules and deep learning methods

Sa. Kasmaiee,
Si. Kasmaiee,
M. Homayounpour

Affiliations

Sa. Kasmaiee: Department of Computer Engineering, Amirkabir University of Technology
Si. Kasmaiee: Department of Computer Engineering, Amirkabir University of Technology
M. Homayounpour: Department of Computer Engineering, Amirkabir University of Technology

DOI: https://doi.org/10.1038/s41598-023-47295-2
Journal volume & issue: Vol. 13, no. 1
pp. 1 – 21

Abstract

Read online

Abstract This study aims to develop a system for automatically correcting spelling errors in Persian texts using two approaches: one that relies on rules and a common spelling mistake list and another that uses a deep neural network. The list of 700 common misspellings was compiled, and a database of 55,000 common Persian words was used to identify spelling errors in the rule-based approach. 112 rules were implemented for spelling correction, each providing suggested words for misspelled words. 2500 sentences were used for evaluation, with the word with the shortest Levenshtein distance selected for evaluation. In the deep learning approach, a deep encoder-decoder network that utilized long short-term memory (LSTM) with a word embedding layer was used as the base network, with FastText chosen as the word embedding layer. The base network was enhanced by adding convolutional and capsule layers. A database of 1.2 million sentences was created, with 800,000 for training, 200,000 for testing, and 200,000 for evaluation. The results showed that the network's performance with capsule and convolutional layers was similar to that of the base network. The network performed well in evaluation, achieving accuracy, precision, recall, F-measure, and bilingual evaluation understudy (Bleu) scores of 87%, 70%, 89%, 78%, and 84%, respectively.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal