IEEE Access (Jan 2021)
Automatic Methods and Neural Networks in Arabic Texts Diacritization: A Comprehensive Survey
Abstract
Arabic diacritics are signs used in Arabic orthography to represent essential morphophonological and syntactic information. It is a common practice to leave out those diacritics in written Arabic. Most Arabic electronic texts lack such diacritics. The processing of those texts for various purposes of Natural Language Processing is a complicated task. Diacritized words are necessary for applications such as machine translation, sentiment analysis, and speech synthesis. To address this problem, several studies proposed automatic systems to restore diacritics in Arabic texts. The present paper presents an in-depth survey of 56 most recent Arabic diacritization studies. Based on the diacritization approach, the studies are grouped into four sections in terms of method; rule-based, simple statistical, hybrid, and Neural Networks. While rule-based methods such as morphological analyzers and lexicon retrievals were the earliest approaches, results indicated that they are still valuable tools that can aid in the process of diacritization. Effective statistical methods that produced diacritics with acceptable accuracy include Hidden Markov Model, n-grams, and Support Vector Machines. They are often accompanied by either rule-based or neural networks in hybrid systems. Neural networks, specifically Bidirectional Long Short Term Memory, reached very high diacritization accuracy levels. Studies employing neural networks focused on evaluating and comparing the efficacy of several types of neural networks or a hybrid of them, testing alternatives of input units or suggested schemes for partial daicritization. The study synthesizes the results of the studies, identifies research gaps, and offers recommendations for future research.
Keywords