IEEE Access (Jan 2020)

Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

  • Kurniawati Azizah,
  • Mirna Adriani,
  • Wisnu Jatmiko

DOI
https://doi.org/10.1109/ACCESS.2020.3027619
Journal volume & issue
Vol. 8
pp. 179798 – 179812

Abstract

Read online

This work applies a hierarchical transfer learning to implement deep neural network (DNN)-based multilingual text-to-speech (TTS) for low-resource languages. DNN-based system typically requires a large amount of training data. In recent years, while DNN-based TTS has made remarkable results for high-resource languages, it still suffers from a data scarcity problem for low-resource languages. In this article, we propose a multi-stage transfer learning strategy to train our TTS model for low-resource languages. We make use of a high-resource language and a joint multilingual dataset of low-resource languages. A pre-trained monolingual TTS on the high-resource language is fine-tuned on the low-resource language using the same model architecture. Then, we apply partial network-based transfer learning from the pre-trained monolingual TTS to a multilingual TTS and finally from the pre-trained multilingual TTS to a multilingual with style transfer TTS. Our experiment on Indonesian, Javanese, and Sundanese languages show adequate quality of synthesized speech. The evaluation of our multilingual TTS reaches a mean opinion score (MOS) of 4.35 for Indonesian (ground truth = 4.36). Whereas for Javanese and Sundanese it reaches a MOS of 4.20 (ground truth = 4.38) and 4.28 (ground truth = 4.20), respectively. For parallel style transfer evaluation, our TTS model reaches an F0 frame error (FFE) of 9.08%, 10.13%, and 8.43% for Indonesian, Javanese, and Sundanese, respectively. The results indicate that the proposed strategy can be effectively applied to the low-resource languages target domain. With a small amount of training data, our models are able to learn step by step from a smaller TTS network to larger networks, produce intelligible speech approaching the real human voice, and successfully transfer speaking style from a reference audio.

Keywords