Systematic Literature Review of Dialectal Arabic: Identification and Detection

Ashraf Elnagar; Sane M. Yagi; Ali Bou Nassif; Ismail Shahin; Said A. Salloum

doi:10.1109/ACCESS.2021.3059504

IEEE Access (Jan 2021)

Systematic Literature Review of Dialectal Arabic: Identification and Detection

Ashraf Elnagar,
Sane M. Yagi,
Ali Bou Nassif,
Ismail Shahin,
Said A. Salloum

Affiliations

Ashraf Elnagar: ORCiD; Department of Computer Science, University of Sharjah, Sharjah, United Arab Emirates
Sane M. Yagi: ORCiD; Department of Foreign Language, University of Sharjah, Sharjah, United Arab Emirates
Ali Bou Nassif: ORCiD; Department of Computer Engineering, University of Sharjah, Sharjah, United Arab Emirates
Ismail Shahin: ORCiD; Department of Electrical Engineering, University of Sharjah, Sharjah, United Arab Emirates
Said A. Salloum: Machine Learning and NLP Research Group, University of Sharjah, Sharjah, United Arab Emirates

DOI: https://doi.org/10.1109/ACCESS.2021.3059504
Journal volume & issue: Vol. 9
pp. 31010 – 31042

Abstract

Read online

It is becoming increasingly difficult to know who is working on what and how in computational studies of Dialectal Arabic. This study comes to chart the field by conducting a systematic literature review that is intended to give insight into the most and least popular research areas, dialects, machine learning approaches, neural network input features, data types, datasets, system evaluation criteria, publication venues, and publication trends. It is a review that is guided by the norms of systematic reviews. It has taken account of all the research that adopted a computational approach to dialectal Arabic identification and detection and that was published between 2000 and 2020. It collected, analyzed, and collated this research, discovered its trends, and identified research gaps. It revealed, inter alia, that our research effort has not been directed evenly between speech and text or between the vernaculars; there is some bias favoring text over speech, regional varieties over individual vernaculars, and Egyptian over all other vernaculars. Furthermore, there is a clear preference for shallow machine learning approaches, for the use of n-grams, TF-IDF, and MFCC as neural network features, and for accuracy as a statistical measure of validation of results. This paper also pointed to some glaring gaps in the research: (1) total neglect of Mauritanian and Bahraini in the continuous Arabic language area and of such enclave varieties as Anatolian Arabic, Khuzistan Arabic, Khurasan Arabic, Uzbekistan Arabic, the Subsaharan Arabic of Nigeria and Chad, Djibouti Arabic, Cypriot Arabic and Maltese; (2) scarcity of city dialect resources; (3) rarity of linguistic investigations that would complement our research; (4) and paucity of deep machine learning experimentation.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords