IEEE Access (Jan 2024)
Masked Duration Model for Utterance Duration-Controllable Text-to-Speech
Abstract
Recent advances in neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, but maintaining naturalness while controlling utterance duration remains a challenging task. Existing approaches for controlling utterance duration rely on post-processing techniques compromising naturalness. These techniques are not effective in all scenarios, particularly when addressing variable utterance durations. This study presents a novel masked duration model that enables controllable utterance duration in TTS synthesis. This approach utilizes audio prompts, text prompts, and masks to predict phone durations within the masked span, which corresponds to the utterance duration. This enables precise control of the utterance duration by determining the target duration initially and predicting the phone durations. The model allows for fine-grained control over utterance duration, enabling more nuanced and realistic speech outputs. Additionally, an adversarial training strategy is employed to enhance the robustness of the alignment between audio and text prompts. The experimental results demonstrate that the proposed model outperformed the baseline model regarding utterance duration control. Ablation studies validate the effectiveness of adversarial training in enhancing model performance. This technology is suitable for applications requiring precise control over utterance duration, such as automatic voice dubbing.
Keywords