Applied Sciences (Feb 2025)

Speaker Diarization: A Review of Objectives and Methods

  • Douglas O’Shaughnessy

DOI
https://doi.org/10.3390/app15042002
Journal volume & issue
Vol. 15, no. 4
p. 2002

Abstract

Read online

Recorded audio often contains speech from multiple people in conversation. It is useful to label such signals with speaker turns, noting when each speaker is talking and identifying each speaker. This paper discusses how to process speech signals to do such speaker diarization (SD). We examine the nature of speech signals, to identify the possible acoustical features that could assist this clustering task. Traditional speech analysis techniques are reviewed, as well as measures of spectral similarity and clustering. Speech activity detection requires separating speech from background noise in general audio signals. SD may use stochastic models (hidden Markov and Gaussian mixture) and embeddings such as x-vectors. Modern neural machine learning methods are examined in detail. Suggestions are made for future improvements.

Keywords