Real Word Spelling Error Detection and Correction for Urdu Language

Romila Aziz; Muhammad Waqas Anwar; Muhammad Hasan Jamal; Usama Ijaz Bajwa; Angel Kuc Castilla; Carlos Uc Rios; Ernesto Bautista Thompson; Imran Ashraf

doi:10.1109/access.2023.3312730

IEEE Access (Jan 2023)

Real Word Spelling Error Detection and Correction for Urdu Language

Romila Aziz,
Muhammad Waqas Anwar,
Muhammad Hasan Jamal,
Usama Ijaz Bajwa,
Angel Kuc Castilla,
Carlos Uc Rios,
Ernesto Bautista Thompson,
Imran Ashraf

Affiliations

Romila Aziz: ORCiD; Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Lahore, Pakistan
Muhammad Waqas Anwar: ORCiD; Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Lahore, Pakistan
Muhammad Hasan Jamal: ORCiD; Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Lahore, Pakistan
Usama Ijaz Bajwa: ORCiD; Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Lahore, Pakistan
Angel Kuc Castilla: Universidad Europea del Atlántico, Santander, Spain
Carlos Uc Rios: Universidad Europea del Atlántico, Santander, Spain
Ernesto Bautista Thompson: Universidad Europea del Atlántico, Santander, Spain
Imran Ashraf: ORCiD; Department of Information and Communication Engineering, Yeungnam University, Gyeongsan, South Korea

DOI: https://doi.org/10.1109/access.2023.3312730
Journal volume & issue: Vol. 11
pp. 100948 – 100962

Abstract

Read online

Non-word and real-word errors are generally two types of spelling errors. Non-word errors are misspelled words that are nonexistent in the lexicon while real-word errors are misspelled words that exist in the lexicon but are used out of context in a sentence. Lexicon-based lookup approach is widely used for non-word errors but it is incapable of handling real-word errors as they require contextual information. Contrary to the English language, real-word error detection and correction for low-resourced languages like Urdu is an unexplored area. This paper presents a real-word spelling error detection and correction approach for the Urdu language. We develop an extensive lexicon of 593,738 words and use this lexicon to develop a dataset for real-word errors comprising 125562 sentences and 2,552,735 words. Based on the developed lexicon and dataset, we then develop a contextual spell checker that detects and corrects real-word errors. For the real-word error detection phase, word-gram features are used along with five machine learning classifiers, achieving a precision, recall, and F1-score of 0.84,0.79, and 0.81 respectively. We also test the proposed approach with a 40% error density. For real-word error correction, the Damerau-Levenshtein distance is used along with the n-gram model for further ranking of the suggested candidate words, achieving an accuracy of up to 83.67%.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords