IEEE Open Journal of Signal Processing (Jan 2024)
Adversarial Training of Denoising Diffusion Model Using Dual Discriminators for High-Fidelity Multi-Speaker TTS
Abstract
The diffusion model is capable of generating high-quality data through a probabilistic approach. However, it suffers from the drawback of slow generation speed due to its requirement for many time steps. To address this limitation, recent models such as denoising diffusion implicit models (DDIM) focus on sample generation without explicitly modeling the entire probability distribution, while models like denoising diffusion generative adversarial networks (GAN) combine diffusion processes with GANs. In the field of speech synthesis, a recent diffusion speech synthesis model called DiffGAN-TTS, which utilizes the structure of GANs, has been introduced and demonstrates superior performance in both speech quality and generation speed. In this paper, to further enhance the performance of DiffGAN-TTS, we propose a speech synthesis model with two discriminators: a diffusion discriminator to learn the distribution of the reverse process, and a spectrogram discriminator to learn the distribution of the generated data. Objective metrics such as the structural similarity index measure (SSIM), mel-cepstral distortion (MCD), F0 root mean squared error (F0- RMSE), phoneme error rate (PER), word error rate (WER), as well as subjective metrics like mean opinion score (MOS), are used to evaluate the performance of the proposed model. The evaluation results demonstrate that our model matches or exceeds recent state-of-the-art models like FastSpeech 2 and DiffGAN-TTS across various metrics. Our code and audio samples are available on GitHub.
Keywords