Data in Brief (Dec 2024)
Machine translation training data for English–Tshivenḓa
Abstract
This data article describes a machine translation training data set for translation between English and Tshivenḓa. The data set contains parallel, aligned English–Tshivenḓa data as well as monolingual Tshivenḓa data. The data was collected from both web crawling of multilingual South African government sites and matched documents from translators or publishing sources. Additional unique data was translated from English into Tshivenḓa by professional translators to increase the total corpus size. This article contains information about the collection and translation of the data as well as how alignments and corpus cleanup were done. The wordcounts of the corpus are also given. In addition to training machine translation systems this data can also be used for the development of other Tshivenḓa core technologies as well as for linguistic studies.