Natural Language Processing Journal (Sep 2024)
Kurdish end-to-end speech synthesis using deep neural networks
Abstract
This article introduces an end-to-end text-to-speech (TTS) system for the low-resourced language of Central Kurdish (CK, also known as Sorani) and tackles the challenges associated with limited data availability. We have compiled a dataset suitable for end-to-end text-to-speech that includes 21 h of CK female voice paired with corresponding texts. To identify the optimal performing system, we employed Tacotron2, an end-to-end deep neural network for speech synthesis, in three training experiments. The process involves training Tacotron2 using a pre-trained English system, followed by training two models from scratch with full and intonationally balanced datasets. We evaluated the effectiveness of these models using Mean Opinion Score (MOS), a subjective evaluation metric. Our findings demonstrate that the model trained from scratch on the full CK dataset surpasses both the model trained with the intonationally balanced dataset and the model trained using a pre-trained English model in terms of naturalness and intelligibility by achieving a MOS of 4.78 out of 5.