DeepDet: YAMNet with BottleNeck Attention Module (BAM) for TTS synthesis detection

Rabbia Mahum; Aun Irtaza; Ali Javed; Haitham A. Mahmoud; Haseeb Hassan

doi:10.1186/s13636-024-00335-9

EURASIP Journal on Audio, Speech, and Music Processing (Apr 2024)

DeepDet: YAMNet with BottleNeck Attention Module (BAM) for TTS synthesis detection

Rabbia Mahum,
Aun Irtaza,
Ali Javed,
Haitham A. Mahmoud,
Haseeb Hassan

Affiliations

Rabbia Mahum: Computer Science Department, UET Taxila
Aun Irtaza: Computer Science Department, UET Taxila
Ali Javed: Software Engineering Department, UET Taxila
Haitham A. Mahmoud: Industrial Engineering Department, College of Engineering, King Saud University
Haseeb Hassan: College of Big Data and Internet, Shenzhen Technology University (SZTU)

DOI: https://doi.org/10.1186/s13636-024-00335-9
Journal volume & issue: Vol. 2024, no. 1
pp. 1 – 16

Abstract

Read online

Abstract Spoofed speeches are becoming a big threat to society due to advancements in artificial intelligence techniques. Therefore, there must be an automated spoofing detector that can be integrated into automatic speaker verification (ASV) systems. In this study, we recommend a novel and robust model, named DeepDet, based on deep-layered architecture, to categorize speech into two classes: spoofed and bonafide. DeepDet is an improved model based on Yet Another Mobile Network (YAMNet) employing a customized MobileNet combined with a bottleneck attention module (BAM). First, we convert audio into mel-spectrograms that consist of time–frequency representations on mel-scale. Second, we trained our deep layered model using the extracted mel-spectrograms on a Logical Access (LA) set, including synthesized speeches and voice conversions of the ASVspoof-2019 dataset. In the end, we classified the audios, utilizing our trained binary classifier. More precisely, we utilized the power of layered architecture and guided attention that can discern the spoofed speech from bonafide samples. Our proposed improved model employs depth-wise linearly separate convolutions, which makes our model lighter weight than existing techniques. Furthermore, we implemented extensive experiments to assess the performance of the suggested model using the ASVspoof 2019 corpus. We attained an equal error rate (EER) of 0.042% on Logical Access (LA), whereas 0.43% on Physical Access (PA) attacks. Therefore, the performance of the proposed model is significant on the ASVspoof 2019 dataset and indicates the effectiveness of the DeepDet over existing spoofing detectors. Additionally, our proposed model is robust enough that can identify the unseen spoofed audios and classifies the several attacks accurately.

Published in EURASIP Journal on Audio, Speech, and Music Processing

ISSN: 1687-4722 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Science: Physics: Acoustics. Sound; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://asmp-eurasipjournals.springeropen.com

About the journal

Abstract

Keywords