GLFER-Net: a polyphonic sound source localization and detection network based on global-local feature extraction and recalibration

Mengzhen Ma; Ying Hu; Liang He; Hao Huang

doi:10.1186/s13636-024-00356-4

EURASIP Journal on Audio, Speech, and Music Processing (Jun 2024)

GLFER-Net: a polyphonic sound source localization and detection network based on global-local feature extraction and recalibration

Mengzhen Ma,
Ying Hu,
Liang He,
Hao Huang

Affiliations

Mengzhen Ma: School of Computer Science and Technology, Xinjiang Univerity
Ying Hu: School of Computer Science and Technology, Xinjiang Univerity
Liang He: School of Computer Science and Technology, Xinjiang Univerity
Hao Huang: School of Computer Science and Technology, Xinjiang Univerity

DOI: https://doi.org/10.1186/s13636-024-00356-4
Journal volume & issue: Vol. 2024, no. 1
pp. 1 – 16

Abstract

Read online

Abstract Polyphonic sound source localization and detection (SSLD) task aims to recognize the categories of sound events, identify their onset and offset times, and detect their corresponding direction-of-arrival (DOA), where polyphonic refers to the occurrence of multiple overlapping sound sources in a segment. However, vanilla SSLD methods based on convolutional recurrent neural network (CRNN) suffer from insufficient feature extraction. The convolutions with kernel of single scale in CRNN fail to adequately extract multi-scale features of sound events, which have diverse time-frequency characteristics. It results in that the extracted features lack fine-grained information helpful for the localization of sound sources. In response to these challenges, we propose a polyphonic SSLD network based on global-local feature extraction and recalibration (GLFER-Net), where the global-local feature (GLF) extractor is designed to extract the multi-scale global features through an omni-directional dynamic convolution (ODConv) layer and multi-scale feature extraction (MSFE) module. The local feature extraction (LFE) unit is designed for capturing detailed information. Besides, we design a feature recalibration (FR) module to emphasize the crucial features along multiple dimensions. On the open datasets of Task3 in DCASE 2021 and 2022 Challenges, we compared our proposed GLFER-Net with six and four SSLD methods, respectively. The results show that the GLFER-Net achieves competitive performance. The modules we designed are verified to be effective through a series of ablation experiments and visualization analyses.

Published in EURASIP Journal on Audio, Speech, and Music Processing

ISSN: 1687-4722 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Science: Physics: Acoustics. Sound; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://asmp-eurasipjournals.springeropen.com

About the journal

Abstract

Keywords