Voice Conversion Based Augmentation and a Hybrid CNN-LSTM Model for Improving Speaker-Independent Keyword Recognition on Limited Datasets

Yeshanew Ale Wubet; Kuang-Yow Lian

doi:10.1109/ACCESS.2022.3200479

IEEE Access (Jan 2022)

Voice Conversion Based Augmentation and a Hybrid CNN-LSTM Model for Improving Speaker-Independent Keyword Recognition on Limited Datasets

Yeshanew Ale Wubet,
Kuang-Yow Lian

Affiliations

Yeshanew Ale Wubet: ORCiD; Department of Electrical Engineering, National Taipei University of Technology, Taipei, Taiwan
Kuang-Yow Lian: ORCiD; Department of Electrical Engineering, National Taipei University of Technology, Taipei, Taiwan

DOI: https://doi.org/10.1109/ACCESS.2022.3200479
Journal volume & issue: Vol. 10
pp. 89170 – 89180

Abstract

Read online

Keyword recognition is the basis of speech recognition, and its application is rapidly increasing in keyword spotting, robotics, and smart home surveillance. Because of these advanced applications, improving the accuracy of keyword recognition is crucial. In this paper, we proposed voice conversion (VC) - based augmentation to increase the limited training dataset and a fusion of a convolutional neural network (CNN) and long-short term memory (LSTM) model for robust speaker-independent isolated keyword recognition. Collecting and preparing a sufficient amount of voice data for speaker-independent speech recognition is a tedious and bulky task. To overcome this, we generated new raw voices from the original voices using an auxiliary classifier conditional variational autoencoder (ACVAE) method. In this study, the main intention of voice conversion is to obtain numerous and various human-like keywords’ voices that are not identical to the source and target speakers’ pronunciation. Parallel VC was used to accurately maintain the linguistic content. We examined the performance of the proposed voice conversion augmentation techniques using robust deep neural network algorithms. Original training data, excluding generated voice using other data augmentation and regularization techniques, were considered as the baseline. The results showed that incorporating voice conversion augmentation into the baseline augmentation techniques and applying the CNN-LSTM model improved the accuracy of isolated keyword recognition.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords