Sound Event Detection System Based on VGGSKCCT Model Architecture with Knowledge Distillation

Sung-Jen Huang; Chia-Chuan Liu; Chia-Ping Chen

doi:10.1080/08839514.2022.2152948

Applied Artificial Intelligence (Dec 2023)

Sound Event Detection System Based on VGGSKCCT Model Architecture with Knowledge Distillation

Sung-Jen Huang,
Chia-Chuan Liu,
Chia-Ping Chen

Affiliations

Sung-Jen Huang: National Sun Yat-sen University 70 Lian-Hai Road Kaohsiung
Chia-Chuan Liu: National Sun Yat-sen University 70 Lian-Hai Road Kaohsiung
Chia-Ping Chen: National Sun Yat-sen University 70 Lian-Hai Road Kaohsiung

DOI: https://doi.org/10.1080/08839514.2022.2152948
Journal volume & issue: Vol. 37, no. 1

Abstract

Read online

Sound event detection involves detecting acoustic events of multiple classes in audio recordings, along with the times of occurrence. Detection and Classification of Acoustic Scenes and Events (DCASE) Task 4 for sound event detection in domestic environments is a contest on this task. In this paper, we engineer sound event detection systems using the data provided and the performance metrics defined in this contest. Note the performance metrics of polyphonic sound detection scores (PSDS) in 2 scenarios are adopted recently to be practical and effective. Our system development started with a basic system through reference to various systems in the contests of previous years. We developed a system similar to that used by the winning team in DCASE Challenge 2021. A clip-level consistency branch is then added to the model architecture to increase the performance of the PSDS in scenario 2, which focuses on identifying different event classes. In addition, we use knowledge distillation with the mean teacher model to improve system performance. In this way, the model can learn from the pre-trained model without being fully restricted by its performance. Finally, we further enhance the system robustness through consistency criteria in the second stage of training. On the official validation set of Domestic Environment Sound Event Detection (DESED) dataset, our final system achieves 0.418 and 0.661 on the PSDS in the two scenarios. It outperforms the 2021 baseline system with 0.341 and 0.546 on both scores quite significantly.

Published in Applied Artificial Intelligence

ISSN: 0883-9514 (Print); 1087-6545 (Online)
Publisher: Taylor & Francis Group
Country of publisher: United Kingdom
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science; Science: Science (General): Cybernetics
Website: https://www.tandfonline.com/journals/uaai

About the journal