IEEE Access (Jan 2024)

Applying Syntax-Prosody Mapping Hypothesis and Boundary-Driven Theory to Neural Sequence-to-Sequence Speech Synthesis

  • Kei Furukawa,
  • Takeshi Kishiyama,
  • Satoshi Nakamura,
  • Sakriani Sakti

DOI
https://doi.org/10.1109/ACCESS.2024.3487053
Journal volume & issue
Vol. 12
pp. 160896 – 160917

Abstract

Read online

This study presents a novel approach to Japanese speech synthesis by applying the syntax-prosody mapping hypothesis and the boundary-driven theory, both from linguistics. Focusing on the phonological phenomena of initial lowering and rhythmic boost, our research introduces the Recursive Phonological Model, which significantly outperforms traditional methods in both objective and subjective evaluation experiments. This study proposes new objective evaluation criteria for Japanese speech synthesis. These criteria offer a more rigorous and linguistically grounded methodology for assessing the quality of synthesized speech. The Recursive Phonological Model accurately captures both the presence and absence of initial lowering, a common phenomenon in Japanese speech. This is the first model to successfully reflect such syntactic variations through intonation, demonstrating its advanced ability to handle complex phonological patterns. Additionally, the model demonstrates a unique proficiency in reproducing the rhythmic boost phenomenon, despite rhythmic boost being absent in the training data. This ability underscores the importance of learning phonological boundaries in speech synthesis. Our approach not only yields more natural-sounding speech but also enriches the field by incorporating complex linguistic theories in the computational process. This research thus marks a significant advance in the naturalness and linguistic accuracy of speech synthesis, with broader implications for computational linguistics and artificial intelligence.

Keywords