Masked Duration Model for Utterance Duration-Controllable Text-to-Speech

Taewoo Kim; Choongsang Cho; Young Han Lee

doi:10.1109/ACCESS.2024.3461772

IEEE Access (Jan 2024)

Masked Duration Model for Utterance Duration-Controllable Text-to-Speech

Taewoo Kim,
Choongsang Cho,
Young Han Lee

Affiliations

Taewoo Kim: ORCiD; Intelligent Information Research and Development Division, Korea Electronics Technology Institute, Seongnam, South Korea
Choongsang Cho: Intelligent Information Research and Development Division, Korea Electronics Technology Institute, Seongnam, South Korea
Young Han Lee: ORCiD; Intelligent Information Research and Development Division, Korea Electronics Technology Institute, Seongnam, South Korea

DOI: https://doi.org/10.1109/ACCESS.2024.3461772
Journal volume & issue: Vol. 12
pp. 136313 – 136318

Abstract

Read online

Recent advances in neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, but maintaining naturalness while controlling utterance duration remains a challenging task. Existing approaches for controlling utterance duration rely on post-processing techniques compromising naturalness. These techniques are not effective in all scenarios, particularly when addressing variable utterance durations. This study presents a novel masked duration model that enables controllable utterance duration in TTS synthesis. This approach utilizes audio prompts, text prompts, and masks to predict phone durations within the masked span, which corresponds to the utterance duration. This enables precise control of the utterance duration by determining the target duration initially and predicting the phone durations. The model allows for fine-grained control over utterance duration, enabling more nuanced and realistic speech outputs. Additionally, an adversarial training strategy is employed to enhance the robustness of the alignment between audio and text prompts. The experimental results demonstrate that the proposed model outperformed the baseline model regarding utterance duration control. Ablation studies validate the effectiveness of adversarial training in enhancing model performance. This technology is suitable for applications requiring precise control over utterance duration, such as automatic voice dubbing.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords