Data in Brief (Feb 2024)
Speech recognition datasets for low-resource Congolese languages
Abstract
Large pre-trained Automatic Speech Recognition (ASR) models have shown improved performance in low-resource languages due to the increased availability of benchmark corpora and the advantages of transfer learning. However, only a limited number of languages possess ample resources to fully leverage transfer learning. In such contexts, benchmark corpora become crucial for advancing methods. In this article, we introduce two new benchmark corpora designed for low-resource languages spoken in the Democratic Republic of the Congo: the Lingala Read Speech Corpus, with 4 h of labelled audio, and the Congolese Speech Radio Corpus, which offers 741 h of unlabelled audio spanning four significant low-resource languages of the region. During data collection, Lingala Read Speech recordings of thirty-two distinct adult speakers, each with a unique context under various settings with different accents, were recorded. Concurrently, Congolese Speech Radio raw data were taken from the archive of broadcast station, followed by a designed curation process. During data preparation, numerous strategies have been utilised for pre-processing the data. The datasets, which have been made freely accessible to all researchers, serve as a valuable resource for not only investigating and developing monolingual methods and approaches that employ linguistically distant languages but also multilingual approaches with linguistically similar languages. Using techniques such as supervised learning and self-supervised learning, they are able to develop inaugural benchmarking of speech recognition systems for Lingala and mark the first instance of a multilingual model tailored for four Congolese languages spoken by an aggregated population of 95 million. Moreover, two models were applied to this dataset. The first is supervised learning modelling and the second is for self-supervised pre-training.