Weakly Supervised U-Net with Limited Upsampling for Sound Event Detection

Sangwon Lee; Hyemi Kim; Gil-Jin Jang

doi:10.3390/app13116822

Applied Sciences (Jun 2023)

Weakly Supervised U-Net with Limited Upsampling for Sound Event Detection

Sangwon Lee,
Hyemi Kim,
Gil-Jin Jang

Affiliations

Sangwon Lee: School of Electronic and Electrical Engineering, Kyungpook National University, Daegu 41566, Republic of Korea
Hyemi Kim: Electronics and Telecommunications Research Institute, Daejeon 34129, Republic of Korea
Gil-Jin Jang: School of Electronic and Electrical Engineering, Kyungpook National University, Daegu 41566, Republic of Korea

DOI: https://doi.org/10.3390/app13116822
Journal volume & issue: Vol. 13, no. 11
p. 6822

Abstract

Read online

Sound event detection (SED) is the task of finding the identities of sound events, as well as their onset and offset timings from audio recordings. When complete timing information is not available in the training data, but only the event identities are known, SED should be solved by weakly supervised learning. The conventional U-Net with global weighted rank pooling (GWRP) has shown a decent performance, but extensive computation is demanded. We propose a novel U-Net with limited upsampling (LUU-Net) and global threshold average pooling (GTAP) to reduce the model size, as well as the computational overhead. The expansion along the frequency axis in the U-Net decoder was minimized, so that the output map sizes were reduced by 40% at the convolutional layers and 12.5% at the fully connected layers without SED performance degradation. The experimental results on a mixed dataset of DCASE 2018 Tasks 1 and 2 showed that our limited upsampling U-Net (LUU-Net) with GTAP was about 23% faster in training and achieved 0.644 in audio tagging and 0.531 in weakly supervised SED tasks in terms of F1 scores, while U-Net with GWRP showed 0.629 and 0.492, respectively. The major contribution of the proposed LUU-Net is the reduction in the computation time with the SED performance being maintained or improved. The other proposed method, GTAP, further improved the training time reduction and provides versatility for various audio mixing conditions by adjusting a single hyperparameter.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords