Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet

Qing-Dao-Er-Ji Ren; Lele Wang; Wenjing Zhang; Leixiao Li

doi:10.3390/app14020625

Applied Sciences (Jan 2024)

Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet

Qing-Dao-Er-Ji Ren,
Lele Wang,
Wenjing Zhang,
Leixiao Li

Affiliations

Qing-Dao-Er-Ji Ren: School of Information Engineering, Inner Mongolia University of Technology, Hohhot 010051, China
Lele Wang: School of Information Engineering, Inner Mongolia University of Technology, Hohhot 010051, China
Wenjing Zhang: School of Information Engineering, Inner Mongolia University of Technology, Hohhot 010051, China
Leixiao Li: College of Data Science and Application, Inner Mongolia University of Technology, Hohhot 010051, China

DOI: https://doi.org/10.3390/app14020625
Journal volume & issue: Vol. 14, no. 2
p. 625

Abstract

Read online

The core challenge of speech synthesis technology is how to convert text information into an audible audio form to meet the needs of users. In recent years, the quality of speech synthesis based on end-to-end speech synthesis models has been significantly improved. However, due to the characteristics of the Mongolian language and the lack of an audio corpus, the Mongolian speech synthesis model has achieved few results, and there are still some problems with the performance and synthesis quality. First, the phoneme information of Mongolian was further improved and a Bang-based pre-training model was constructed to reduce the error rate of Mongolian phonetic synthesized words. Second, a Mongolian speech synthesis model based on Ghost and ILPCnet was proposed, named the Ghost-ILPCnet model, which was improved based on the Para-WaveNet acoustic model, replacing ordinary convolution blocks with stacked Ghost modules to generate Mongolian acoustic features in parallel and improve the speed of speech generation. At the same time, the improved vocoder ILPCnet had a high synthesis quality and low complexity compared to other vocoders. Finally, a large number of data experiments were conducted on the proposed model to verify its effectiveness. The experimental results show that the Ghost-ILPCnet model has a simple structure, fewer model generation parameters, fewer hardware requirements, and can be trained in parallel. The average subjective opinion score of its synthesized speech reached 4.48 and the real-time rate reached 0.0041. It ensures the naturalness and clarity of synthesized speech, speeds up the synthesis speed, and effectively improves the performance of the Mongolian speech synthesis model.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords