IEEE Access (Jan 2021)
Semi-Supervised NMF-CNN for Sound Event Detection
Abstract
The lack of strongly labeled data can limit the potential of a Sound Event Detection (SED) system trained using deep learning approaches. To address this issue, this paper proposes a novel method to approximate strong labels for the weakly labeled data using Nonnegative Matrix Factorization (NMF) in a supervised manner. Using a combinative transfer learning and semi-supervised learning framework, two different Convolutional Neural Networks (CNN) are trained using synthetic data, approximated strongly labeled data, and unlabeled data where one model will produce the audio tags. In contrast, the other will produce the frame-level prediction. The proposed methodology is then evaluated on three different subsets of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 dataset: validation dataset, challenge evaluation dataset, and public YouTube evaluation dataset. Based on the results, our proposed methodology outperforms the baseline system by a minimum of 7% across these three different data subsets. In addition, our proposed method also outperforms the top 3 submissions from the DCASE 2019 challenge task 4 on the validation and public YouTube evaluation datasets. Our system performance is also competitive against the top submission in DCASE 2020 challenge task 4 on the challenge evaluation data. A post-challenge analysis was also performed using the validation dataset, which revealed the causes of the performance difference between our system and the top submission of the DCASE 2020 challenge task 4. The leading causes that we observed are 1) detection threshold tuning method and 2) augmentation techniques used. We observed that our system could perform better than the first place submission by 1.5% by changing our detection threshold tuning method. In addition, the post-challenge analysis also revealed that our system is more robust than the top submission in DCASE 2020 challenge task 4 on long-duration audio clips, where we outperformed them by 37%.
Keywords