Emotional Voice Conversion Using a Hybrid Framework With Speaker-Adaptive DNN and Particle-Swarm-Optimized Neural Network

Susmitha Vekkot; Deepa Gupta; Mohammed Zakariah; Yousef Ajami Alotaibi

doi:10.1109/ACCESS.2020.2988781

IEEE Access (Jan 2020)

Emotional Voice Conversion Using a Hybrid Framework With Speaker-Adaptive DNN and Particle-Swarm-Optimized Neural Network

Susmitha Vekkot,
Deepa Gupta,
Mohammed Zakariah,
Yousef Ajami Alotaibi

Affiliations

Susmitha Vekkot: ORCiD; Department of Electronics & Communication Engineering, Amrita School of Engineering, Bengaluru, Amrita Vishwa Vidyapeetham, India
Deepa Gupta: ORCiD; Department of Computer Science & Engineering, Amrita School of Engineering, Bengaluru, Amrita Vishwa Vidyapeetham, India
Mohammed Zakariah: Research Center, College of Computer and Information Science, King Saud University, Riyadh, Saudi Arabia
Yousef Ajami Alotaibi: Computer Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia

DOI: https://doi.org/10.1109/ACCESS.2020.2988781
Journal volume & issue: Vol. 8
pp. 74627 – 74647

Abstract

Read online

We propose a hybrid network-based learning framework for speaker-adaptive vocal emotion conversion, tested on three different datasets (languages), namely, EmoDB (German), IITKGP (Telugu), and SAVEE (English). The optimized learning model introduced is unique because of its ability to synthesize emotional speech with an acceptable perceptive quality while preserving speaker characteristics. The multilingual model is extremely beneficial in scenarios wherein emotional training data from a specific target speaker are sparsely available. The proposed model uses speaker-normalized mel-generalized cepstral coefficients for spectral training with data adaptation using the seed data from the target speaker. The fundamental frequency (F0) is transformed using a wavelet synchrosqueezed transform prior to mapping to obtain a sharpened time-frequency representation. Moreover, a feedforward artificial neural network, together with particle swarm optimization, was used for F0 training. Additionally, static-intensity modification was also performed for each test utterance. Using the framework, we were able to capture the spectral and pitch contour variabilities of emotional expression better than with other state-of-the-art methods used in this study. Considering the overall performance scores across datasets, an average melcepstral distortion (MCD) of 4.98 and root mean square error (RMSE-F0) of 10.67 were obtained in objective evaluations, and an average comparative mean opinion score (CMOS) of 3.57 and speaker similarity score of 3.70 were obtained for the proposed framework. Particularly, the best MCD of 4.09 (EmoDB-happiness) and RMSE-F0 of 9.00 (EmoDB-anger) were obtained, along with the maximum CMOS of 3.7 and speaker similarity of 4.6, thereby highlighting the effectiveness of the hybrid network model.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords