Applied Sciences (Sep 2019)

A Review of Deep Learning Based Speech Synthesis

  • Yishuang Ning,
  • Sheng He,
  • Zhiyong Wu,
  • Chunxiao Xing,
  • Liang-Jie Zhang

DOI
https://doi.org/10.3390/app9194050
Journal volume & issue
Vol. 9, no. 19
p. 4050

Abstract

Read online

Speech synthesis, also known as text-to-speech (TTS), has attracted increasingly more attention. Recent advances on speech synthesis are overwhelmingly contributed by deep learning or even end-to-end techniques which have been utilized to enhance a wide range of application scenarios such as intelligent speech interaction, chatbot or conversational artificial intelligence (AI). For speech synthesis, deep learning based techniques can leverage a large scale of <text, speech> pairs to learn effective feature representations to bridge the gap between text and speech, thus better characterizing the properties of events. To better understand the research dynamics in the speech synthesis field, this paper firstly introduces the traditional speech synthesis methods and highlights the importance of the acoustic modeling from the composition of the statistical parametric speech synthesis (SPSS) system. It then gives an overview of the advances on deep learning based speech synthesis, including the end-to-end approaches which have achieved start-of-the-art performance in recent years. Finally, it discusses the problems of the deep learning methods for speech synthesis, and also points out some appealing research directions that can bring the speech synthesis research into a new frontier.

Keywords