Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection

Baoqing Chen; Mei Wang; Yu Gu

doi:10.3390/s24186090

Sensors (Sep 2024)

Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection

Baoqing Chen,
Mei Wang,
Yu Gu

Affiliations

Baoqing Chen: School of Information and Communication, Guilin University of Electronic Technology, Guilin 541004, China
Mei Wang: College of Physics and Electronic Information Engineering, Guilin University of Technology, Guilin 541004, China
Yu Gu: School of Information and Communication, Guilin University of Electronic Technology, Guilin 541004, China

DOI: https://doi.org/10.3390/s24186090
Journal volume & issue: Vol. 24, no. 18
p. 6090

Abstract

Read online

Sound event localization and detection (SELD) is a crucial component of machine listening that aims to simultaneously identify and localize sound events in multichannel audio recordings. This task demands an integrated analysis of spatial, temporal, and frequency domains to accurately characterize sound events. The spatial domain pertains to the varying acoustic signals captured by multichannel microphones, which are essential for determining the location of sound sources. However, the majority of recent studies have focused on time-frequency correlations and spatio-temporal correlations separately, leading to inadequate performance in real-life scenarios. In this paper, we propose a novel SELD method that utilizes the newly developed Spatio-Temporal-Frequency Fusion Network (STFF-Net) to jointly learn comprehensive features across spatial, temporal, and frequency domains of sound events. The backbone of our STFF-Net is the Enhanced-3D (E3D) residual block, which combines 3D convolutions with a parameter-free attention mechanism to capture and refine the intricate correlations among these domains. Furthermore, our method incorporates the multi-ACCDOA format to effectively handle homogeneous overlaps between sound events. During the evaluation, we conduct extensive experiments on three de facto benchmark datasets, and our results demonstrate that the proposed SELD method significantly outperforms current state-of-the-art approaches.

Published in Sensors

ISSN: 1424-8220 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology
Website: http://www.mdpi.com/journal/sensors

About the journal

Abstract

Keywords