IEEE Access (Jan 2023)
Synthetic Speech Spoofing Detection Based on Online Hard Example Mining
Abstract
The online hard example mining (OHEM) algorithm has been successfully applied for object detection in images. In this paper, we propose an innovative application of the OHEM algorithm for training synthetic speech spoofing detection models, which addresses the problem of imbalance between simple and hard samples in the dataset. Our experimental results show that the introduction of the OHEM algorithm significantly decreases the equal error rate (EER) for four deep neural network models, namely ResNet18, ResNet50, SE-Res2Net, and Raw-Res2Net. The relative decreases are 42%, 28%, 25%, and 22%, respectively. Raw-Res2Net is a new network architecture proposed in this paper, which uses raw audio as the input. The study finds that this model performs significantly better than the other three experimental models in identifying some spoofing attack algorithms. Moreover, compared to the two baseline systems of the ASVspoof 2019 competition, the EERs of the Raw-Res2Net model are relatively reduced by 63% and 68%, respectively. Finally, by combining the scores of multiple synthetic speech spoofing detection systems based on the OHEM algorithm, the Raw-Res2Net-OHEM model complements the results of other models well. Without using any data augmentation techniques, the final fused system achieves an EER and a Minimum Tandem Detection Cost Function (min-DCF) of 0.77% and 0.022, respectively, on the ASVspoof 2019 LA dataset, outperforming the results reported in all published literature on fused systems.
Keywords