Hybrid Transformer Architectures With Diverse Audio Features for Deepfake Speech Classification

Khalid Zaman; Islam J. A. M. Samiul; Melike Sah; Cem Direkoglu; Shogo Okada; Masashi Unoki

doi:10.1109/ACCESS.2024.3478731

IEEE Access (Jan 2024)

Hybrid Transformer Architectures With Diverse Audio Features for Deepfake Speech Classification

Khalid Zaman,
Islam J. A. M. Samiul,
Melike Sah,
Cem Direkoglu,
Shogo Okada,
Masashi Unoki

Affiliations

Khalid Zaman: ORCiD; Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
Islam J. A. M. Samiul: Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
Melike Sah: ORCiD; Computer Engineering Department, Cyprus International University, North Cyprus, Nicosia, Türkiye
Cem Direkoglu: ORCiD; Electrical and Electronics Engineering Department, Middle East Technical University, Northern Cyprus Campus, Kalkanli, Güzelyurt, Türkiye
Shogo Okada: ORCiD; Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
Masashi Unoki: ORCiD; Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan

DOI: https://doi.org/10.1109/ACCESS.2024.3478731
Journal volume & issue: Vol. 12
pp. 149221 – 149237

Abstract

Read online

The rise of synthetic speech technologies has triggered growing concerns about the increasing difficulty in distinguishing between real and fake voices. In this context, we propose novel hybrid transformer-based models together with different audio feature analysis techniques and achieved the state-of-the-art results. To the best of our knowledge, none of the existing methods have considered combining various hybrid transformer models together with different audio features for fake speech classification, which forms the main novelty of our work. In our work, transformer models are compared with hybrid transformer architectures including Convolutional Neural Network (CNN)-Transformer (i.e., ResNet34- Transformer and VGG16-Transformer models), Bi-directional Long Short-Term Memory (Bi-LSTM)- Transformer, and Transformer with Support Vector Machine (SVM) using different audio feature extraction techniques. In our approach, we utilize three audio attribute extraction techniques: Mel spectrogram (Mel), Mel Frequency Cepstral Coefficient (MFCC), and Short-Time Fourier Transform (STFT) as input representations. The results of our evaluation with instances of real and fake speech using the ASVspoof LA dataset with hybrid transformer models across various audio features indicate that the STFT feature performs best with the ResNet34-Transformer model, achieving a state-of-the-art performance with a development set equal error rate (EER) of 0.0% and evaluation set EER of 3.22%, surpassing all other methods. In terms of accuracy, the STFT feature also performed best with the VGG16-Transformer model, achieving a development set accuracy of 99.55% and an evaluation set accuracy of 94.04%. These results indicate that the proposed study achieved better performance compared to the baseline and state-of-the-art methods.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords