DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect

Hanane Nour Moussa; Asmaa Mourhir

Data in Brief (Jun 2023)

DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect

Hanane Nour Moussa,
Asmaa Mourhir

Affiliations

Hanane Nour Moussa: Corresponding author.; School of Science and Engineering, Al Akhawayn University in Ifrane, P.O. Box 104, Hassan II Avenue, Ifrane 53000, Morocco
Asmaa Mourhir: School of Science and Engineering, Al Akhawayn University in Ifrane, P.O. Box 104, Hassan II Avenue, Ifrane 53000, Morocco

Journal volume & issue: Vol. 48
p. 109234

Abstract

Read online

DarNERcorp is a manually annotated named entity recognition (NER) dataset in the Moroccan dialect, also called Darija. The dataset consists of 65,905 tokens and their corresponding tags according to BIO scheme. 13.8% of the tokens are named entities spanning four categories: person, location, organization, and miscellaneous. The data were scraped from the Moroccan Dialect section of Wikipedia and processed and annotated using open-source libraries and tools. The data are useful for the Arabic natural language processing (NLP) community as they address the lack in dialectal Arabic annotated corpora. This dataset can be used to train and evaluate named entity recognition systems in dialectal and mixed Arabic.

Published in Data in Brief

ISSN: 2352-3409 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Science (General)
Website: http://www.journals.elsevier.com/data-in-brief/

About the journal

Abstract

Keywords