Dianxin kexue (Nov 2024)

A method of synthetic spoofing speech detection using self-supervised contrastive learning

  • YANG Man,
  • JIAN Zhihua,
  • LIANG Chenghan

Journal volume & issue
Vol. 40
pp. 40 – 49

Abstract

Read online

In order to eliminate the impact of the imbalance of the sample size of bonafide speech and fake speech in the training dataset on the performance of synthetic speech detection system and further improve the accuracy of synthetic speech detection, a method of synthetic speech detection was proposed based on self-supervised contrastive learning. In this method, the samples after pitch transformation were regarded as negative samples, and the neural network was trained to make the anchor sample features different from the negative sample features, so that the network could extract the features sensitive to pitch transformation. And the deep residual network was used as the back-end classifier to judge the authenticity of the speech. Experimental results show that, compared with the traditional hand-crafted acoustic features, the deep learning-based and the end-to-end spoofing speech detection systems, the proposed method significantly reduces the equal error rate of the system. The synthetic forged speech detection method based on self-supervised contrastive learning can train the network to extract features sensitive to pitch transformation and will not affect the accuracy of synthetic speech detection because of the imbalance of bonafide and fake speech in the dataset, so the accuracy of synthetic forged speech detection is significantly improved.

Keywords