IEEE Access (Jan 2023)
EDL-Det: A Robust TTS Synthesis Detector Using VGG19-Based YAMNet and Ensemble Learning Block
Abstract
Various audio deep fake synthesis algorithms exist, such as deep voice, tacotron, fastspeech, and imitation techniques. Despite the existence of various spoofing speech detectors, they are not ready to distinguish unseen audio samples with high precision. In this study, we suggest a robust model, namely an Ensemble Deep Learning Detector (EDL-Det), to detect text-to-speech (TTS) and categorize it into spoofed and bonafide classes. Our proposed model is an improved method based on Yet Another Multi-scale Convolutional Neural Network (YAMNet) employing VGG19 as a base network combined with two other deep learning(DL) techniques. Our proposed system effectively analyzes the audio to extract better artifacts. We have added an ensemble learning block that consists of ResNet50 and InceptionNetv2. First, we convert speech into mel-spectrograms that consist of time-frequency representations. Second, we train our model using the ASVspoof-2019 dataset. Ultimately, we classified the audios, transforming them into mel-spectrograms using our trained binary classifier and a majority voting scheme by three networks. Due to ensemble architecture, our proposed model effectively extracts the most representative features from the mel-spectrograms. Furthermore, we have performed extensive experiments to assess the performance of the suggested model using the ASVspoof 2019 corpus. Additionally, our proposed model is robust enough to identify the unseen spoofed audios and accurately classify the attacks based on cloning algorithms.
Keywords