A multi-task learning speech synthesis optimization method based on CWT: a case study of Tacotron2

Guoqiang Hu; Zhuofan Ruan; Wenqiu Guo; Yujuan Quan

doi:10.1186/s13634-023-01096-x

EURASIP Journal on Advances in Signal Processing (Jan 2024)

A multi-task learning speech synthesis optimization method based on CWT: a case study of Tacotron2

Guoqiang Hu,
Zhuofan Ruan,
Wenqiu Guo,
Yujuan Quan

Affiliations

Guoqiang Hu: International School, Jinan University
Zhuofan Ruan: Information Hub, The HONG KONG University of Science and Technology(Guangzhou)
Wenqiu Guo: School of Business, Macau University of Science and Technology
Yujuan Quan: College of Information Science and Technology, Jinan University

DOI: https://doi.org/10.1186/s13634-023-01096-x
Journal volume & issue: Vol. 2024, no. 1
pp. 1 – 14

Abstract

Read online

Abstract Text-to-speech synthesis plays an essential role in facilitating human-computer interaction. Currently, the predominant approach in Text-to-speech acoustic models selects only the Mel spectrum as an intermediate feature for converting text to speech. However, the Mel spectrograms obtained may exhibit ambiguity in some aspects owing to the limited capability of the Fourier transform to capture mutation signals during the acquisition of the Mel spectrograms. With the aim of improving the clarity of synthesized speech, this study proposes a multi-task learning optimization method and conducts experiments on the Tacotron2 speech synthesis system to demonstrate the effectiveness of the proposed method. The method in the study introduces an additional task: wavelet spectrograms. The continuous wavelet transform has gained significant popularity in various applications, including speech enhancement and speech recognition, which is primarily attributed to its capability to adaptively vary the time-frequency resolution and its excellent performance in capturing non-stationary signals. This study highlights that the clarity of Tacotron2 synthesized speech can be improved by introducing Wavelet-spectrogram as an auxiliary task through theoretical and experimental analysis: a feature extraction network is added, and Wavelet-spectrogram features are extracted from the Mel spectrum output generated by the decoder. Experimental findings indicate that the Mean Opinion Score achieved for the speech synthesized by the model using multi-task learning is 0.17 higher compared to the baseline model. Furthermore, by analyzing the factors contributing to the success of the continuous wavelet transform-based multi-task learning method in the Tacotron2 model, as well as the effectiveness of multi-task learning, the study conjectures that the proposed method has the potential to enhance the performance of other acoustic models.

Published in EURASIP Journal on Advances in Signal Processing

ISSN: 1687-6172 (Print); 1687-6180 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Telecommunication; Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics
Website: https://asp-eurasipjournals.springeropen.com

About the journal

Abstract

Keywords