IEEE Access (Jan 2024)
Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers
Abstract
Deep neural network based text-to-speech (TTS) technology has brought advances in speech synthesis approaching the quality of human speech sounds. Zero-shot voice cloning TTS is a system that accepts input in the form of text and a few seconds of a sample of the target speaker’s voice to produce speech sound waves similar to the target speaker’s voice. Some of the latest zero-shot voice cloning TTS studies still focuses on normal human voices. However, this technology still has limitations for individuals with speech disorders such as dysphonia. We observe that our baseline zero-shot TTS model applied to the dysphonia domain still has poor performance on the following aspects: speaker similarity, intelligibility or clarity of speech, and speech sound quality. This research develops 24 zero-shot voice cloning TTS models to observe which models can improve the baseline model performance on the dysphonia domain. We propose four categories to change the baseline model architecture and setting: input-level text sequences (grapheme, phoneme, or combination of grapheme-phoneme), speaker embedding type (speaker encoder or speaker model), speaker embedding position (at TTS encoder of at TTS encoder and decoder), and loss function (without or with speaker consistency loss). The experimental results show that the best model is the one uses the following configuration settings: a combination of grapheme-phoneme-level text sequences, speaker model as the speaker embedding, placing the speaker embedding at the TTS encoder only, and adding speaker consistency loss to the frame-level speech loss. Compared to the baseline model, our proposed best model is able to improve speaker cosine similarity (COS), speech intelligibility (CER), and speech sound quality (MOS) performance in the domain of dysphonia speech disorders by 0.197, 0.55%, and 0.244, respectively. When compared with the original voice of dysphonia disorder speakers, the best model also increases the speech intelligibility and quality of the speech sounds by 13.45% and 0.22, respectively.
Keywords