You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection

Satvik Venkatesh; David Moffat; Eduardo Reck Miranda

doi:10.3390/app12073293

Applied Sciences (Mar 2022)

You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection

Satvik Venkatesh,
David Moffat,
Eduardo Reck Miranda

Affiliations

Satvik Venkatesh: Interdisciplinary Centre for Computer Music Research, University of Plymouth, Plymouth PL4 8AA, UK
David Moffat: Interdisciplinary Centre for Computer Music Research, University of Plymouth, Plymouth PL4 8AA, UK
Eduardo Reck Miranda: Interdisciplinary Centre for Computer Music Research, University of Plymouth, Plymouth PL4 8AA, UK

DOI: https://doi.org/10.3390/app12073293
Journal volume & issue: Vol. 12, no. 7
p. 3293

Abstract

Read online

Audio segmentation and sound event detection are crucial topics in machine listening that aim to detect acoustic classes and their respective boundaries. It is useful for audio-content analysis, speech recognition, audio-indexing, and music information retrieval. In recent years, most research articles adopt segmentation-by-classification. This technique divides audio into small frames and individually performs classification on these frames. In this paper, we present a novel approach called You Only Hear Once (YOHO), which is inspired by the YOLO algorithm popularly adopted in Computer Vision. We convert the detection of acoustic boundaries into a regression problem instead of frame-based classification. This is done by having separate output neurons to detect the presence of an audio class and predict its start and end points. The relative improvement for F-measure of YOHO, compared to the state-of-the-art Convolutional Recurrent Neural Network, ranged from 1% to 6% across multiple datasets for audio segmentation and sound event detection. As the output of YOHO is more end-to-end and has fewer neurons to predict, the speed of inference is at least 6 times faster than segmentation-by-classification. In addition, as this approach predicts acoustic boundaries directly, the post-processing and smoothing is about 7 times faster.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords