Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati

Tanja Gaustad; Martin J. Puttkammer

Data in Brief (Apr 2022)

Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati

Tanja Gaustad,
Martin J. Puttkammer

Affiliations

Tanja Gaustad: Corresponding author.; Centre for Text Technology, North-West University, South Africa
Martin J. Puttkammer: Centre for Text Technology, North-West University, South Africa

Journal volume & issue: Vol. 41
p. 107994

Abstract

Read online

This data article presents a linguistically annotated data set for four official South African languages with a conjunctive orthography, namely isiNdebele, isiXhosa, isiZulu and Siswati. The data set is parallel for all four languages and can be used for language-specific as well as cross-language development and evaluation of Natural Language Processing (NLP) core technologies. In addition, it can be used for corpus linguistic studies. The article describes how the data was collected, what type of texts it contains and it provides some details on the three different types of linguistic annotation added (morphology, part-of-speech and lemmas), including an example.

Published in Data in Brief

ISSN: 2352-3409 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Science (General)
Website: http://www.journals.elsevier.com/data-in-brief/

About the journal

Abstract

Keywords