Updated Morphologically Annotated Corpora for 9 South African Languages

Tanja Gaustad; Cindy A. McKellar

doi:10.5334/johd.211

Journal of Open Humanities Data (Jun 2024)

Updated Morphologically Annotated Corpora for 9 South African Languages

Tanja Gaustad,
Cindy A. McKellar

Affiliations

Tanja Gaustad: ORCiD; Centre for Text Technology (CTexT), North-West University, Potchefstroom
Cindy A. McKellar: ORCiD; Centre for Text Technology (CTexT), North-West University, Potchefstroom

DOI: https://doi.org/10.5334/johd.211
Journal volume & issue: Vol. 10
pp. 38 – 38

Abstract

Read online

The dataset described in this article presents converted and updated corpora for nine of the twelve official South African languages. After a revision of the morphological annotation protocols, the existing National Centre for Human Language Technology (NCHLT) corpora (Eiselen & Puttkammer, 2014) have been converted to updated morphological tags and consequently checked by linguistic experts for correctness. The resulting corpora are uniformly linguistically annotated for morphology across all nine languages, amounting to approximately 70,000 tokens for the five disjunctively written languages and 45,000 tokens for the four conjunctively written languages. The corpora are primarily aimed at the development and evaluation of Natural Language Processing (NLP) core technologies. In addition, the data can be used for language-specific and cross-language comparative corpus linguistic studies as well as corpus-based investigations of morphological phenomena in the included languages.

Published in Journal of Open Humanities Data

ISSN: 2059-481X (Online)
Publisher: Ubiquity Press
Country of publisher: United Kingdom
LCC subjects: General Works: History of scholarship and learning. The humanities; Language and Literature
Website: https://openhumanitiesdata.metajnl.com/

About the journal

Abstract

Keywords