Data in Brief (Jun 2024)

Dataset for Siswati: Parallel textual data for English and Siswati and monolingual textual data for Siswati

  • Tanja Gaustad,
  • Cindy A. McKellar,
  • Martin J. Puttkammer

Journal volume & issue
Vol. 54
p. 110325

Abstract

Read online

This data article presents a dataset for Siswati, a Bantu language of the Nguni group that is one of the eleven official South African languages and the official language of Eswatini (together with English). The dataset contains parallel textual data between English and Siswati as well as monolingual data for Siswati and was developed for use as training data for machine translation systems, specifically the Autshumato machine translation project. Both corpora can also be used for development and evaluation of Natural Language Processing (NLP) core technologies for Siswati. In addition, the data lends itself for corpus linguistic studies. The article describes how the data was collected, what type of texts it contains and what clean-up was done. It also provides an overview of the number of words contained in the datasets.

Keywords