Speaker Diarization: A Review of Objectives and Methods

Douglas O’Shaughnessy

doi:10.3390/app15042002

Applied Sciences (Feb 2025)

Speaker Diarization: A Review of Objectives and Methods

Douglas O’Shaughnessy

Affiliations

Douglas O’Shaughnessy: INRS-EMT, Montreal, QC H5A1K6, Canada

DOI: https://doi.org/10.3390/app15042002
Journal volume & issue: Vol. 15, no. 4
p. 2002

Abstract

Read online

Recorded audio often contains speech from multiple people in conversation. It is useful to label such signals with speaker turns, noting when each speaker is talking and identifying each speaker. This paper discusses how to process speech signals to do such speaker diarization (SD). We examine the nature of speech signals, to identify the possible acoustical features that could assist this clustering task. Traditional speech analysis techniques are reviewed, as well as measures of spectral similarity and clustering. Speech activity detection requires separating speech from background noise in general audio signals. SD may use stochastic models (hidden Markov and Gaussian mixture) and embeddings such as x-vectors. Modern neural machine learning methods are examined in detail. Suggestions are made for future improvements.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords