Jisuanji kexue (Aug 2021)

Overview of Speech Synthesis and Voice Conversion Technology Based on Deep Learning

  • PAN Xiao-qin, LU Tian-liang, DU Yan-hui, TONG Xin

DOI
https://doi.org/10.11896/jsjkx.200500148
Journal volume & issue
Vol. 48, no. 8
pp. 200 – 208

Abstract

Read online

Voice information processing technology is developing rapidly under the impetus of deep learning.The combination of speech synthesis and voice conversion technology can achieve real-time high-fidelity voice output of designated objects and content,and has broad application prospects in man-machine interaction,pan-entertainment and other fields.This paper aims to provide an overview of speech synthesis and voice conversion technology based on deep learning.First,this paper briefly reviews the development of speech synthesis and voice conversion technology.Next,it enumerates the common public datasets in these fields so that it is convenient for researchers to carry out related explorations.Then,it discusses the TTS models,including the classic and cutting-edge models and algorithms in terms of style,rhythm,speed,and compares their effects and development potentials respectively.Then,it reviews voice conversion by summarizing the voice conversion methods and optimization methods.Finally,it summarizes the applications and challenges of speech synthesis and voice conversion,and looks forward to their future development direction in model compression,few-shot learning and forgery detection,based on the problems faced by them in terms of model,application and regulation.

Keywords