Pyramid Feature Attention Network for Speech Resampling Detection

Xinyu Zhou; Yujin Zhang; Yongqi Wang; Jin Tian; Shaolun Xu

doi:10.3390/app14114803

Applied Sciences (Jun 2024)

Pyramid Feature Attention Network for Speech Resampling Detection

Xinyu Zhou,
Yujin Zhang,
Yongqi Wang,
Jin Tian,
Shaolun Xu

Affiliations

Xinyu Zhou: School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China
Yujin Zhang: School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China
Yongqi Wang: School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China
Jin Tian: School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China
Shaolun Xu: School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

DOI: https://doi.org/10.3390/app14114803
Journal volume & issue: Vol. 14, no. 11
p. 4803

Abstract

Read online

Speech forgery and tampering, increasingly facilitated by advanced audio editing software, pose significant threats to the integrity and privacy of digital speech avatars. Speech resampling is a post-processing operation of various speech-tampering means, and the forensic detection of speech resampling is of great significance. For speech resampling detection, most of the previous works used traditional methods of feature extraction and classification to distinguish original speech from forged speech. In view of the powerful ability of deep learning to extract features, this paper converts the speech signal into a spectrogram with time-frequency characteristics, and uses the feature pyramid network (FPN) with the Squeeze and Excitation (SE) attention mechanism to learn speech resampling features. The proposed method combines the low-level location information and the high-level semantic information, which dramatically improves the detection performance of speech resampling. Experiments were carried out on a resampling corpus made on the basis of the TIMIT dataset. The results indicate that the proposed method significantly improved the detection accuracy of various resampled speech. For the tampered speech with a resampling factor of 0.9, the detection accuracy is increased by nearly 20%. In addition, the robustness test demonstrates that the proposed model has strong resistance to MP3 compression, and the overall performance is better than the existing methods.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords