IEEE Access (Jan 2024)

Potential of Speech-Pathological Features for Deepfake Speech Detection

  • Anuwat Chaiwongyen,
  • Suradej Duangpummet,
  • Jessada Karnjana,
  • Waree Kongprawechnon,
  • Masashi Unoki

DOI
https://doi.org/10.1109/ACCESS.2024.3447582
Journal volume & issue
Vol. 12
pp. 121958 – 121970

Abstract

Read online

There is a great concern regarding the misuse of deepfake speech technology to synthesize a real person’s voice. Therefore, developing speech-security systems capable of detecting deepfake speech remains paramount in safeguarding against such misuse. Although various speech features and methods have been proposed, their potential for distinguishing between genuine and deepfake speech remains unclear. Since speech-pathological features with deep learning are widely used to assess unnaturalness in disordered voices associated with voice-production mechanisms, we investigated the potential of eleven speech-pathological features for distinguishing between genuine and deepfake speech, i.e., jitter (three types), shimmer (four types), harmonics-to-noise ratio, cepstral-harmonics-to-noise ratio, normalized noise energy, and glottal-to-noise excitation ratio. This paper proposes a method of combining two models on the basis of two different dimensions of speech-pathological features to greatly improve the effectiveness of deepfake speech detection, along with mel-spectrogram features, to enhance detection efficiency. We evaluated the proposed method on the datasets of the Automatic Speaker Verification Spoofing and Countermeasures Challenges ASVspoof 2019 and 2021. The results indicate that the proposed method outperforms the baselines in terms of accuracy, recall, F1-score, and F2-score, achieving 95.06, 99.46, 97.30, and 98.59%, respectively, on the ASVspoof 2019 dataset. It also surpasses the baselines on the ASVspoof 2021 dataset in terms of recall, F1-score, F2-score, and equal error rate, achieving 99.96, 96.65, 98.18, and 15.97%, respectively.

Keywords