Robustness of ad hoc microphone clustering using speaker embeddings: evaluation under realistic and challenging scenarios

Stijn Kindt; Jenthe Thienpondt; Luca Becker; Nilesh Madhu

doi:10.1186/s13636-023-00310-w

EURASIP Journal on Audio, Speech, and Music Processing (Oct 2023)

Robustness of ad hoc microphone clustering using speaker embeddings: evaluation under realistic and challenging scenarios

Stijn Kindt,
Jenthe Thienpondt,
Luca Becker,
Nilesh Madhu

Affiliations

Stijn Kindt: IDLab, Department of Electronics and Information Systems, Ghent University - imec
Jenthe Thienpondt: IDLab, Department of Electronics and Information Systems, Ghent University - imec
Luca Becker: Institute of Communication Acoustics, Ruhr-Universität Bochum
Nilesh Madhu: IDLab, Department of Electronics and Information Systems, Ghent University - imec

DOI: https://doi.org/10.1186/s13636-023-00310-w
Journal volume & issue: Vol. 2023, no. 1
pp. 1 – 20

Abstract

Read online

Abstract Speaker embeddings, from the ECAPA-TDNN speaker verification network, were recently introduced as features for the task of clustering microphones in ad hoc arrays. Our previous work demonstrated that, in comparison to signal-based Mod-MFCC features, using speaker embeddings yielded a more robust and logical clustering of the microphones around the sources of interest. This work aims to further establish speaker embeddings as a robust feature for ad hoc microphone clustering by addressing open and additional questions of practical interest, arising from our prior work. Specifically, whereas our initial work made use of simulated data based on shoe-box acoustics models, we now present a more thorough analysis in more realistic settings. Furthermore, we investigate additional important considerations such as the choice of the distance metric used in the fuzzy C-means clustering; the minimal time range across which data need to be aggregated to obtain robust clusters; and the performance of the features in increasingly more challenging situations, and with multiple speakers. We also contrast the results on the basis of several metrics for quantifying the quality of such ad hoc clusters. Results indicate that the speaker embeddings are robust to short inference times, and deliver logical and useful clusters, even when the sources are very close to each other.

Published in EURASIP Journal on Audio, Speech, and Music Processing

ISSN: 1687-4722 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Science: Physics: Acoustics. Sound; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://asmp-eurasipjournals.springeropen.com

About the journal

Abstract

Keywords