Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis

Xiao Zhou; Zhenhua Ling; Yajun Hu; Lirong Dai

doi:10.3390/app112110475

Applied Sciences (Nov 2021)

Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis

Xiao Zhou,
Zhenhua Ling,
Yajun Hu,
Lirong Dai

Affiliations

Xiao Zhou: National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei 230026, China
Zhenhua Ling: National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei 230026, China
Yajun Hu: iFLYTEK Research, Hefei 230088, China
Lirong Dai: National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei 230026, China

DOI: https://doi.org/10.3390/app112110475
Journal volume & issue: Vol. 11, no. 21
p. 10475

Abstract

Read online

An encoder–decoder with attention has become a popular method to achieve sequence-to-sequence (Seq2Seq) acoustic modeling for speech synthesis. To improve the robustness of the attention mechanism, methods utilizing the monotonic alignment between phone sequences and acoustic feature sequences have been proposed, such as stepwise monotonic attention (SMA). However, the phone sequences derived by grapheme-to-phoneme (G2P) conversion may not contain the pauses at the phrase boundaries in utterances, which challenges the assumption of strictly stepwise alignment in SMA. Therefore, this paper proposes to insert hidden states into phone sequences to deal with the situation that pauses are not provided explicitly, and designs a semi-stepwise monotonic attention (SSMA) to model these inserted hidden states. In this method, hidden states are introduced that absorb the pause segments in utterances in an unsupervised way. Thus, the attention at each decoding frame has three options, moving forward to the next phone, staying at the same phone, or jumping to a hidden state. Experimental results show that SSMA can achieve better naturalness of synthetic speech than SMA when phrase boundaries are not available. Moreover, the pause positions derived from the alignment paths of SSMA matched the manually labeled phrase boundaries quite well.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords