Effects of Sinusoidal Model on Non-Parallel Voice Conversion with Adversarial Learning

Mohammed Salah Al-Radhi; Tamás Gábor Csapó; Géza Németh

doi:10.3390/app11167489

Applied Sciences (Aug 2021)

Effects of Sinusoidal Model on Non-Parallel Voice Conversion with Adversarial Learning

Mohammed Salah Al-Radhi,
Tamás Gábor Csapó,
Géza Németh

Affiliations

Mohammed Salah Al-Radhi: Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1111 Budapest, Hungary
Tamás Gábor Csapó: Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1111 Budapest, Hungary
Géza Németh: Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1111 Budapest, Hungary

DOI: https://doi.org/10.3390/app11167489
Journal volume & issue: Vol. 11, no. 16
p. 7489

Abstract

Read online

Voice conversion (VC) transforms the speaking style of a source speaker to the speaking style of a target speaker by keeping linguistic information unchanged. Traditional VC techniques rely on parallel recordings of multiple speakers uttering the same sentences. Earlier approaches mainly find a mapping between the given source–target speakers, which contain pairs of similar utterances spoken by different speakers. However, parallel data are computationally expensive and difficult to collect. Non-parallel VC remains an interesting but challenging speech processing task. To address this limitation, we propose a method that allows a non-parallel many-to-many voice conversion by using a generative adversarial network. To the best of the authors’ knowledge, our study is the first one that employs a sinusoidal model with continuous parameters to generate converted speech signals. Our method involves only several minutes of training examples without parallel utterances or time alignment procedures, where the source–target speakers are entirely unseen by the training dataset. Moreover, empirical study is carried out on the publicly available CSTR VCTK corpus. Our conclusions indicate that the proposed method reached the state-of-the-art results in speaker similarity to the utterance produced by the target speaker, while suggesting important structural ones to be further analyzed by experts.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords