Arabic Diacritization Using Bidirectional Long Short-Term Memory Neural Networks With Conditional Random Fields

Abdulmohsen Al-Thubaity; Atheer Alkhalifa; Abdulrahman Almuhareb; Waleed Alsanie

doi:10.1109/ACCESS.2020.3018885

IEEE Access (Jan 2020)

Arabic Diacritization Using Bidirectional Long Short-Term Memory Neural Networks With Conditional Random Fields

Abdulmohsen Al-Thubaity,
Atheer Alkhalifa,
Abdulrahman Almuhareb,
Waleed Alsanie

Affiliations

Abdulmohsen Al-Thubaity: ORCiD; National Center for Data Analytics and Artificial Intelligence, KACST, Riyadh, Saudi Arabia
Atheer Alkhalifa: ORCiD; National Center for Data Analytics and Artificial Intelligence, KACST, Riyadh, Saudi Arabia
Abdulrahman Almuhareb: ORCiD; National Center for Data Analytics and Artificial Intelligence, KACST, Riyadh, Saudi Arabia
Waleed Alsanie: ORCiD; National Center for Data Analytics and Artificial Intelligence, KACST, Riyadh, Saudi Arabia

DOI: https://doi.org/10.1109/ACCESS.2020.3018885
Journal volume & issue: Vol. 8
pp. 154984 – 154996

Abstract

Read online

Arabic diacritics play a significant role in distinguishing words with the same orthography but different meanings, pronunciations, and syntactic functions. The presence of Arabic diacritics can be useful in many natural language processing applications, such as text-to-speech tasks, machine translation, and part-of-speech tagging. This article discusses the use of bidirectional long short-term memory neural networks with conditional random fields for Arabic diacritization. This approach requires no morphological analyzers, dictionary, or feature engineering, but rather uses a sequence-to-sequence schema. The input is a sequence of characters that constitute the sentence, and the output consists of the corresponding diacritic(s) for each character in that sentence. The performance of the proposed approach was examined using four datasets with different sizes and genres, namely, the King Abdulaziz City for Science and Technology text-to-speech (KACST TTS) dataset, the Holy Quran, Sahih Al-Bukhary, and the Penn Arabic Treebank (ATB). For training, 60% of the sentences were randomly selected from each dataset, 20% were selected for validation, and 20% were selected for testing. The trained models achieved diacritic error rates of 3.41%, 1.34%, 1.57%, and 2.13% and word error rates of 14.46%, 4.92%, 5.65%, and 8.43% on the KACST TTS, Holy Quran, Sahih Al-Bukhary, and ATB datasets, respectively. Comparison of the proposed method with those used in other studies and existing systems revealed that its results are comparable to or better than those of the state-of-the-art methods.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords