IEEE Access (Jan 2023)
Contributions of Jitter and Shimmer in the Voice for Fake Audio Detection
Abstract
Fake audio detection (FAD) aims to identify fraudulent speech generated through advanced speech-synthesis techniques. Most current FAD methods rely solely on a deep neural network (DNN) framework with either speech waveforms or commonly used acoustic features to extract high-level representations, overlooking the analysis of prosody differences between genuine and fake speech. Prosody carries important cues about the naturalness of speech and emotional content, which can be leveraged in the detection of fake audio. This paper explicitly investigates the differences in prosody information between genuine and fake speech represented by the jitter and shimmer features. On the basis of our investigation, we found strong evidence that obvious differences exist in the level of jitter and shimmer between fake and real speech, particularly on the shimmer feature that has a large dynamic variation for fake speech. To ensure accurate estimation of $F_{0}$ for better jitter and shimmer feature representations, we propose using two additional $F_{0}$ estimation methods, YIN and SWIPE, in place of the IRAPT algorithm in the feature extraction process. Moreover, we design a DNN-FAD system by explicitly combining the shimmer and Mel-spectrogram features. The effectiveness of the proposed method for FAD is evaluated in the datasets of Audio Deep Synthesis Detection (ADD) 2022 and 2023 challenges. The experimental results show that both the static and dynamic continuous shimmer features, especially that extracted with the YIN and SWIPE algorithms, can provide complementary knowledge to the traditional spectrum-based FAD systems. The optimal results effectively reduce the equal error rate from 41.29 % to 35.77 % in the ADD2023 challenge, achieving a relative improvement of 13.37 %.
Keywords