Journal of Open Humanities Data (Jun 2024)

Updated Morphologically Annotated Corpora for 9 South African Languages

  • Tanja Gaustad,
  • Cindy A. McKellar

DOI
https://doi.org/10.5334/johd.211
Journal volume & issue
Vol. 10
pp. 38 – 38

Abstract

Read online

The dataset described in this article presents converted and updated corpora for nine of the twelve official South African languages. After a revision of the morphological annotation protocols, the existing National Centre for Human Language Technology (NCHLT) corpora (Eiselen & Puttkammer, 2014) have been converted to updated morphological tags and consequently checked by linguistic experts for correctness. The resulting corpora are uniformly linguistically annotated for morphology across all nine languages, amounting to approximately 70,000 tokens for the five disjunctively written languages and 45,000 tokens for the four conjunctively written languages. The corpora are primarily aimed at the development and evaluation of Natural Language Processing (NLP) core technologies. In addition, the data can be used for language-specific and cross-language comparative corpus linguistic studies as well as corpus-based investigations of morphological phenomena in the included languages.

Keywords