Applied Sciences (Apr 2024)

A Preliminary Study of Model-Generated Speech

  • Man-Ni Chu,
  • Yu-Chun Wang

DOI
https://doi.org/10.3390/app14073104
Journal volume & issue
Vol. 14, no. 7
p. 3104

Abstract

Read online

The goal of this study was to compare model-generated sounds with the process of sound acquisition in humans. The research utilized two dictionaries of the Chaoshan dialect spanning approximately one century. Identical Chinese characters were selected from each dictionary, and their contemporary pronunciations were documented. Subsequently, inconsistencies in pronunciation were manually rectified, following which three machine learning methods were employed to train the pronunciation of words from one dictionary to another. These methods comprised the attention-based sequence-to-sequence method, DirecTL+, and Sequitur. The accuracy of the model was evaluated using five-fold cross-validation, revealing a maximum accuracy of 68%. Additionally, the study investigated how the probability of a sound’s subsequent unit influences the accuracy of the machine learning methods. The attention-based sequence-to-sequence model is not solely influenced by the frequency of input but also by the probability of the subsequent unit.

Keywords