Contributions of Jitter and Shimmer in the Voice for Fake Audio Detection

Kai Li; Xugang Lu; Masato Akagi; Masashi Unoki

doi:10.1109/ACCESS.2023.3301616

IEEE Access (Jan 2023)

Contributions of Jitter and Shimmer in the Voice for Fake Audio Detection

Kai Li,
Xugang Lu,
Masato Akagi,
Masashi Unoki

Affiliations

Kai Li: ORCiD; Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
Xugang Lu: ORCiD; Advanced Speech Technology Laboratory, National Institute of Information and Communications Technology, Kyoto, Japan
Masato Akagi: ORCiD; Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
Masashi Unoki: ORCiD; Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan

DOI: https://doi.org/10.1109/ACCESS.2023.3301616
Journal volume & issue: Vol. 11
pp. 84689 – 84698

Abstract

Read online

Fake audio detection (FAD) aims to identify fraudulent speech generated through advanced speech-synthesis techniques. Most current FAD methods rely solely on a deep neural network (DNN) framework with either speech waveforms or commonly used acoustic features to extract high-level representations, overlooking the analysis of prosody differences between genuine and fake speech. Prosody carries important cues about the naturalness of speech and emotional content, which can be leveraged in the detection of fake audio. This paper explicitly investigates the differences in prosody information between genuine and fake speech represented by the jitter and shimmer features. On the basis of our investigation, we found strong evidence that obvious differences exist in the level of jitter and shimmer between fake and real speech, particularly on the shimmer feature that has a large dynamic variation for fake speech. To ensure accurate estimation of $F_{0}$ for better jitter and shimmer feature representations, we propose using two additional $F_{0}$ estimation methods, YIN and SWIPE, in place of the IRAPT algorithm in the feature extraction process. Moreover, we design a DNN-FAD system by explicitly combining the shimmer and Mel-spectrogram features. The effectiveness of the proposed method for FAD is evaluated in the datasets of Audio Deep Synthesis Detection (ADD) 2022 and 2023 challenges. The experimental results show that both the static and dynamic continuous shimmer features, especially that extracted with the YIN and SWIPE algorithms, can provide complementary knowledge to the traditional spectrum-based FAD systems. The optimal results effectively reduce the equal error rate from 41.29 % to 35.77 % in the ADD2023 challenge, achieving a relative improvement of 13.37 %.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords