Computer Methods and Programs in Biomedicine Update (Jan 2022)
Freely Available Arabic Corpora: A Scoping Review
Abstract
Background: Corpora play a vital role when training machine learning (ML) models and building systems that use natural language processing (NLP). It can be challenging for researchers to access corpora in a language other than English, and even more so if the corpora are not available for free of cost. The Arabic language is used by more than 1.5 billion Muslims and is the native language of over 250 million people as the Quran, the core text of Islam, is written in Arabic. Objective: To highlight peer-reviewed literature reporting free and accessible Arabic corpora. We aimed to benefit researchers by providing insights into freely available Arabic and accessible corpora, allowing them to achieve their research goals with ease. Methods: By conducting a scoping review using PRISMA guidelines, we searched the most common information technology (IT) databases and identified free of cost and accessible Arabic corpora. Results: We identified a total of 48 accessible corpora sources available free of cost in the Arabic language, we present our findings according to categories to further help readers understand the corpora with direct links where available. The results were classified by corpora type into five categories based on their primary purpose. Conclusion: Arabic is underrepresented considering freely available corpora as most such corpora are available in English. Although previous studies have performed searches for corpora, ours is the first of its kind as it follows the PRISMA guidelines and includes peer-reviewed articles in the literature, obtained by searching the most common IT databases and source recommendations from language experts.