NCHLT Auxiliary speech data for ASR technology development in South Africa

Jaco Badenhorst; Febe de Wet

Data in Brief (Apr 2022)

NCHLT Auxiliary speech data for ASR technology development in South Africa

Jaco Badenhorst,
Febe de Wet

Affiliations

Jaco Badenhorst: Corresponding author..; Voice Computing Research Group, CSIR Next Generation Enterprises and Institutions Cluster, P.O. Box 395, Pretoria 0001, South Africa
Febe de Wet: Voice Computing Research Group, CSIR Next Generation Enterprises and Institutions Cluster, P.O. Box 395, Pretoria 0001, South Africa; Department of Electrical and Electronic Engineering, Stellenbosch University, Private Bag X1, Stellenbosch 7602, South Africa

Journal volume & issue: Vol. 41
p. 107860

Abstract

Read online

The aim of the National Centre for Human Language Technology (NCHLT) project was to create speech and text resources that would enable Human Language Technology (HLT) development for the 11 official languages of South Africa. The speech data described in this paper was collected during the NCHLT project using a smartphone application. The official NCHLT Speech Corpus was released in 2014, but it did not include all recordings that were made during the data collection campaign. This paper describes the additional data that was recently released as auxiliary corpora [2]. The auxiliary data sets contain between 20 and 170 hours of speech data per language as well as the transcriptions associated with each utterance. In terms of the resources required for HLT development South Africa’s official languages are all under-resourced. The data described in this paper contributes toward alleviating this situation, specifically for the development of speech technology.

Published in Data in Brief

ISSN: 2352-3409 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Science (General)
Website: http://www.journals.elsevier.com/data-in-brief/

About the journal

Abstract

Keywords