IEEE Access (Jan 2023)

Self-Conditioning via Intermediate Predictions for End-to-End Neural Speaker Diarization

  • Yusuke Fujita,
  • Tetsuji Ogawa,
  • Tetsunori Kobayashi

DOI
https://doi.org/10.1109/ACCESS.2023.3340307
Journal volume & issue
Vol. 11
pp. 140069 – 140076

Abstract

Read online

This paper presents a speaker diarization model that incorporates label dependency via intermediate predictions. The proposed method is categorized as an end-to-end neural diarization (EEND), which has been a promising method for solving the speaker diarization problem with a multi-label classification neural network. While most EEND-based models assume conditional independence between frame-level speaker labels, the proposed method introduces the label dependency to the models by exploiting the self-conditioning mechanism, which has been originally applied to an automatic speech recognition model. With the self-conditioning mechanism, speaker labels are iteratively refined by taking the whole sequence of intermediate speaker labels as a reference. We demonstrate the effectiveness of self-conditioning in both Transformer-based and attractor-based EEND models. To efficiently train the attractor-based EEND model, we propose an improved attractor computation module named non-autoregressive attractor, which produces speaker-wise attractors simultaneously in a non-autoregressive manner. The experiments with the CALLHOME two-speaker dataset show that the proposed self-conditioning boosts the diarization performance and progressively reduces errors through successive intermediate predictions. In addition, the proposed non-autoregressive attractor improves training efficiency and provides a synergetic boost with self-conditioning, leading to superior performance compared with existing diarization models.

Keywords