Augmentation Techniques for Adult-Speech to Generate Child-Like Speech Data Samples at Scale

Mariam Yahayah Yiwere; Andrei Barcovschi; Rishabh Jain; Horia Cucu; Peter Corcoran

doi:10.1109/ACCESS.2023.3317360

IEEE Access (Jan 2023)

Augmentation Techniques for Adult-Speech to Generate Child-Like Speech Data Samples at Scale

Mariam Yahayah Yiwere,
Andrei Barcovschi,
Rishabh Jain,
Horia Cucu,
Peter Corcoran

Affiliations

Mariam Yahayah Yiwere: ORCiD; School of Electrical and Electronics Engineering, University of Galway, Galway, Ireland
Andrei Barcovschi: ORCiD; School of Electrical and Electronics Engineering, University of Galway, Galway, Ireland
Rishabh Jain: ORCiD; School of Electrical and Electronics Engineering, University of Galway, Galway, Ireland
Horia Cucu: ORCiD; Speech and Dialogue Research Laboratory, University Politehnica of Bucharest, Bucharest, Romania
Peter Corcoran: ORCiD; School of Electrical and Electronics Engineering, University of Galway, Galway, Ireland

DOI: https://doi.org/10.1109/ACCESS.2023.3317360
Journal volume & issue: Vol. 11
pp. 109066 – 109081

Abstract

Read online

Technologies such as Text-To-Speech (TTS) synthesis and Automatic Speech Recognition (ASR) have become important in providing speech-based Artificial Intelligence (AI) solutions in today’s AI-centric technology sector. Most current research work and solutions focus largely on adult speech compared to child speech. The main reason for this disparity can be linked to the limited availability of children’s speech datasets that can be used in training modern speech AI systems. In this paper, we propose and validate a speech augmentation pipeline to transform existing adult speech datasets into synthetic child-like speech. We use a publicly available phase vocoder-based toolbox for manipulating sound files to tune the pitch and duration of the adult speech utterances making them sound child-like. Both objective and subjective evaluations are performed on the resulting synthetic child utterances. For the objective evaluation, the similarities of the selected top adults’ speaker embeddings are compared before and after the augmentation to a mean child speaker embedding. The average adult voice is shown to have a cosine similarity of approximately 0.87 (87%) relative to the mean child voice after augmentation, compared to a similarity of approximately 0.74 (74%) before augmentations. Mean Opinion Score (MOS) tests were also conducted for the subjective evaluation, with average MOS scores of 3.7 for how convincing the samples are as child-speech and 4.6 for how intelligible the speech is. Finally, ASR models fine-tuned with the augmented speech are tested against a baseline set of ASR experiments showing some modest improvements over the baseline model finetuned with only adult speech.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords