Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder

Yi Zhao; Shinji Takaki; Hieu-Thi Luong; Junichi Yamagishi; Daisuke Saito; Nobuaki Minematsu

doi:10.1109/ACCESS.2018.2872060

IEEE Access (Jan 2018)

Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder

Yi Zhao,
Shinji Takaki,
Hieu-Thi Luong,
Junichi Yamagishi,
Daisuke Saito,
Nobuaki Minematsu

Affiliations

Yi Zhao: ORCiD; Department of Electrical Engineering and Information Systems, Graduate School of Engineering, The University of Tokyo, Tokyo, Japan
Shinji Takaki: Digital Content and Media Sciences Research Division, National Institute of Informatics, Tokyo, Japan
Hieu-Thi Luong: Digital Content and Media Sciences Research Division, National Institute of Informatics, Tokyo, Japan
Junichi Yamagishi: Digital Content and Media Sciences Research Division, National Institute of Informatics, Tokyo, Japan
Daisuke Saito: Department of Electrical Engineering and Information Systems, Graduate School of Engineering, The University of Tokyo, Tokyo, Japan
Nobuaki Minematsu: Department of Electrical Engineering and Information Systems, Graduate School of Engineering, The University of Tokyo, Tokyo, Japan

DOI: https://doi.org/10.1109/ACCESS.2018.2872060
Journal volume & issue: Vol. 6
pp. 60478 – 60488

Abstract

Read online

WaveNet, which learns directly from speech waveform samples, has been used as an alternative to vocoders and achieved very high-quality synthetic speech in terms of both naturalness and speaker similarity even in multi-speaker text-to-speech synthesis systems. However, the WaveNet vocoder uses acoustic features as local condition parameters, and these parameters need to be accurately predicted by another acoustic model. So far, it is not yet clear how to train this acoustic model, which is problematic because the final quality of synthetic speech is significantly affected by the performance of the acoustic model. Significant degradation occurs, especially when predicted acoustic features have mismatched characteristics compared to natural ones. In order to reduce the mismatched characteristics between natural and generated acoustic features, we propose new frameworks that incorporate either a conditional generative adversarial network (GAN) or its variant, Wasserstein GAN with gradient penalty (WGAN-GP), into multi-speaker speech synthesis that uses the WaveNet vocoder. The GAN generator performs as an acoustic model and its outputs are used as the local condition parameters of the WaveNet. We also extend the GAN frameworks and use the discretized-mixture-of-logistics (DML) loss of a well-trained WaveNet in addition to mean squared error and adversarial losses as parts of objective functions. Experimental results show that acoustic models trained using the WGAN-GP framework using back-propagated DML loss achieves the highest subjective evaluation scores in terms of both quality and speaker similarity.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords