Multi Speaker Natural Speech Synthesis Using Generative Flows

Dmitry Obukhov

doi:10.25559/SITITO.17.202104.896-905

Современные информационные технологии и IT-образование (Dec 2021)

Multi Speaker Natural Speech Synthesis Using Generative Flows

Dmitry Obukhov

Affiliations

Dmitry Obukhov: ORCiD; Novosibirsk State Technical University; Dasha.AI, Novosibirsk, Russia

DOI: https://doi.org/10.25559/SITITO.17.202104.896-905
Journal volume & issue: Vol. 17, no. 4
pp. 896 – 905

Abstract

Read online

Modern speech synthesis systems generate natural speech and have high performance. Models using generative flows, among others, have shown impressive results, allowing you to form a variety of speech pronunciation from a given text. However, they are focused on synthesizing the voice of one given speaker. Despite the recently proposed techniques for taking into account several speakers in training, the quality of multi speaker speech synthesis leaves much to be desired. This paper proposes techniques to improve the quality of multi speaker synthesis using acoustic models based on generative flows. As one of such techniques, it is proposed to obtain information on the alignment along the time axis between a speech audio signal and a text sequence from an external system. Such forced alignments allow you to determine at what point in time which sound was uttered and is necessary for the considered parallel speech synthesis system, since it allows you to solve the problem of mismatching the lengths of the input and output sequences. An external alignment system is more accurate than internal heuristics for training, since it is able to learn on a larger amount of data and therefore has a greater generalizing ability. Another proposed technique is to use real vectors of fixed dimension obtained from the external system, containing information about the speaker, the speaker embeddings. In this paper, speaker embeddings obtained from the system for solving the problem of speaker verification are considered. Such representations of a speaker have the property that embeddings obtained from speech fragments of one speaker are located side by side in space, and embeddings obtained from speech fragments of different speakers are far from each other. Due to such representations of the speaker, the synthesis system better forms speech with the voices of different speakers.

Published in Современные информационные технологии и IT-образование

ISSN: 2411-1473 (Print)
Publisher: The Fund for Promotion of Internet media, IT education, human development «League Internet Media»
Country of publisher: Russian Federation
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://sitito.cs.msu.ru

About the journal

Abstract

Keywords