Applied Sciences (Jan 2023)

HierTTS: Expressive End-to-End Text-to-Waveform Using a Multi-Scale Hierarchical Variational Auto-Encoder

  • Zengqiang Shang,
  • Peiyang Shi,
  • Pengyuan Zhang,
  • Li Wang,
  • Guangying Zhao

DOI
https://doi.org/10.3390/app13020868
Journal volume & issue
Vol. 13, no. 2
p. 868

Abstract

Read online

End-to-end text-to-speech (TTS) models that directly generate waveforms from text are gaining popularity. However, existing end-to-end models are still not natural enough in their prosodic expressiveness. Additionally, previous studies on improving the expressiveness of TTS have mainly focused on acoustic models. There is a lack of research on enhancing expressiveness in an end-to-end framework. Therefore, we propose HierTTS, a highly expressive end-to-end text-to-waveform generation model. It deeply couples the hierarchical properties of speech with hierarchical variational auto-encoders and models multi-scale latent variables, at the frame, phone, subword, word, and sentence levels. The hierarchical encoder encodes the speech signal from fine-grained features into coarse-grained latent variables. In contrast, the hierarchical decoder generates fine-grained features conditioned on the coarse-grained latent variables. We propose a staged KL-weighted annealing strategy to prevent hierarchical posterior collapse. Furthermore, we employ a hierarchical text encoder to extract linguistic information at different levels and act on both the encoder and the decoder. Experiments show that our model performs closer to natural speech in prosody expressiveness and has better generative diversity.

Keywords