An Incremental Approach to Corpus Design and Construction: Application to a Large Contemporary Saudi Corpus

Hebah Elgibreen; Mohammed Faisal; Mansour Al Sulaiman; Sherif Abdou; Mohamed Amine Mekhtiche; Abdullah M. Moussa; Yousef A. Alohali; Wadood Abdul; Ghulam Muhammad; Mohsen Rashwan; Mohammed Algabri

doi:10.1109/ACCESS.2021.3089924

IEEE Access (Jan 2021)

An Incremental Approach to Corpus Design and Construction: Application to a Large Contemporary Saudi Corpus

Hebah Elgibreen,
Mohammed Faisal,
Mansour Al Sulaiman,
Sherif Abdou,
Mohamed Amine Mekhtiche,
Abdullah M. Moussa,
Yousef A. Alohali,
Wadood Abdul,
Ghulam Muhammad,
Mohsen Rashwan,
Mohammed Algabri

Affiliations

Hebah Elgibreen: ORCiD; Center of Smart Robotics Research, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
Mohammed Faisal: ORCiD; Center of Smart Robotics Research, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
Mansour Al Sulaiman: ORCiD; Center of Smart Robotics Research, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
Sherif Abdou: Department of Information Technology, Faculty of Computers and Artificial Intelligence, Cairo University, Giza, Egypt
Mohamed Amine Mekhtiche: ORCiD; Center of Smart Robotics Research, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
Abdullah M. Moussa: ORCiD; Department of Information Technology, Faculty of Computers and Artificial Intelligence, Cairo University, Giza, Egypt
Yousef A. Alohali: Center of Smart Robotics Research, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
Wadood Abdul: Center of Smart Robotics Research, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
Ghulam Muhammad: ORCiD; Center of Smart Robotics Research, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
Mohsen Rashwan: Department of Electronics and Electrical Communications, Faculty of Engineering, Cairo University, Giza, Egypt
Mohammed Algabri: ORCiD; Center of Smart Robotics Research, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia

DOI: https://doi.org/10.1109/ACCESS.2021.3089924
Journal volume & issue: Vol. 9
pp. 88405 – 88428

Abstract

Read online

Due to the rapid developments in technology and the sudden expansion of social media use, Dialect Arabic has become an important source of data that needs to be addressed when building Arabic corpora. In this paper, thirty-three Arabic corpora are surveyed to show that despite all of the developments in the literature, Saudi dialect (SD) corpora still need further expansion. This paper contributes to the literature on SD corpora by creating the largest Saudi corpus – the King Saud University Saudi Corpus (KSUSC) – with +1B total words, including +119M SD words. The KSUSC not only is the newest and largest SD corpus but is also diverse, covering 26 domains in text collected from five different sources. This paper also contributes to the literature by developing a new incremental preprocessing system that is used to create relevant lexicons that are then used to clean and normalize the collected data. This incremental system is scalable and can be adapted for different resources and dialects. Moreover, the collection process for building the KSUSC is discussed in detail, and the challenges in collecting SD text with respect to each platform are highlighted. By the end of this paper, different design criteria are proposed and used with the KSUSC to conclude that the resulting corpus can be of great benefit to researchers who are interested in integrating the corpus with their own work or using its resulting lexicons with Saudi-based NLP tasks.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords