Many-to-Many Unsupervised Speech Conversion From Nonparallel Corpora

Yun Kyung Lee; Hyun Woo Kim; Jeon Gue Park

doi:10.1109/ACCESS.2021.3058382

IEEE Access (Jan 2021)

Many-to-Many Unsupervised Speech Conversion From Nonparallel Corpora

Yun Kyung Lee,
Hyun Woo Kim,
Jeon Gue Park

Affiliations

Yun Kyung Lee: ORCiD; Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute (ETRI), Daejeon, South Korea
Hyun Woo Kim: Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute (ETRI), Daejeon, South Korea
Jeon Gue Park: Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute (ETRI), Daejeon, South Korea

DOI: https://doi.org/10.1109/ACCESS.2021.3058382
Journal volume & issue: Vol. 9
pp. 27278 – 27286

Abstract

Read online

We address a nonparallel data-driven many-to-many speech modeling and multimodal style conversion method. In this work, we train a speech conversion model for multiple domains rather than a specific source and target domain pair, and we generate diverse output speech signals from a given source domain speech by transferring some speech style-related characteristics while preserving its linguistic content information. The proposed method comprises a variational autoencoder (VAE)-based many-to-many speech conversion network with a Wasserstein generative adversarial network (WGAN) and a skip-connected autoencoder-based self-supervised learning network. The proposed conversion network trains the models by decomposing the spectral features of the input speech signal into a content factor that represents domain-invariant information and a style factor that represents domain-related information to automatically estimate the various speech styles of each domain, and the network converts the input speech signal to another domain using the computed content factor with the target style factor we want to change. Diverse and multimodal outputs can be generated by sampling different style factors. We also train models in a stable manner and improve the quality of generated outputs by sharing the discriminator of the VAE-based speech conversion network and that of the self-supervised learning network. We apply the proposed method to speaker conversion and perform the perceptual evaluations. Experimental results revealed that the proposed method obtained high accuracy of converted spectra, significantly improved the sound quality and speaker similarity of the converted speech, and contributed to stable model training.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords