IEEE Access (Jan 2024)
Hybrid Transformer Architectures With Diverse Audio Features for Deepfake Speech Classification
Abstract
The rise of synthetic speech technologies has triggered growing concerns about the increasing difficulty in distinguishing between real and fake voices. In this context, we propose novel hybrid transformer-based models together with different audio feature analysis techniques and achieved the state-of-the-art results. To the best of our knowledge, none of the existing methods have considered combining various hybrid transformer models together with different audio features for fake speech classification, which forms the main novelty of our work. In our work, transformer models are compared with hybrid transformer architectures including Convolutional Neural Network (CNN)-Transformer (i.e., ResNet34- Transformer and VGG16-Transformer models), Bi-directional Long Short-Term Memory (Bi-LSTM)- Transformer, and Transformer with Support Vector Machine (SVM) using different audio feature extraction techniques. In our approach, we utilize three audio attribute extraction techniques: Mel spectrogram (Mel), Mel Frequency Cepstral Coefficient (MFCC), and Short-Time Fourier Transform (STFT) as input representations. The results of our evaluation with instances of real and fake speech using the ASVspoof LA dataset with hybrid transformer models across various audio features indicate that the STFT feature performs best with the ResNet34-Transformer model, achieving a state-of-the-art performance with a development set equal error rate (EER) of 0.0% and evaluation set EER of 3.22%, surpassing all other methods. In terms of accuracy, the STFT feature also performed best with the VGG16-Transformer model, achieving a development set accuracy of 99.55% and an evaluation set accuracy of 94.04%. These results indicate that the proposed study achieved better performance compared to the baseline and state-of-the-art methods.
Keywords