Jisuanji kexue (Apr 2022)
End-to-End Speech Synthesis Based on BERT
Abstract
To address the problems of low training and prediction efficiency of RNN-based neural network speech synthesis mo-dels and long-distance information loss, an end-to-end BERT-based speech synthesis method is proposed to use the Self-Attention Mechanism instead of RNN as an encoder in the Seq2Seq architecture of speech synthesis.The method uses a pre-trained BERT as the model's Encoder to extract contextual information from the input text content, the Decoder outputs the Mel spectrum by using the same architecture as the speech synthesis model Tacotron2, and finally the trained WaveGlow network is used to transform the Mel spectrum into the final audio result.This method significantly reduces the training parameters and training time by fine-tuning the downstream task based on pre-trained BERT.At the same time, it can also compute the hidden states in the encoder in parallel with its Self-Attention mechanism, thus making full use of the parallel computing power of the GPU to improve the training efficiency and effectively alleviate the remote dependency problem.Through comparison experiments with the Tacotron2 model, the results show that the model proposed in this paper is able to double the training speed while obtaining similar results to the Tacotron2 model.
Keywords