EURASIP Journal on Audio, Speech, and Music Processing (Dec 2024)

Variants of LSTM cells for single-channel speaker-conditioned target speaker extraction

  • Ragini Sinha,
  • Christian Rollwage,
  • Simon Doclo

DOI
https://doi.org/10.1186/s13636-024-00384-0
Journal volume & issue
Vol. 2024, no. 1
pp. 1 – 13

Abstract

Read online

Abstract Speaker-conditioned target speaker extraction aims at estimating the target speaker from a mixture of speakers utilizing auxiliary information about the target speaker. In this paper, we consider a single-channel target speaker extraction system consisting of a speaker embedder network and a speaker separator network. Instead of using standard long short-term memory (LSTM) cells in the separator network, we propose two variants of LSTM cells that are customized for speaker-conditioned target speaker extraction. The first variant customizes both the forget gate and input gate of the LSTM cell, aiming at retaining only relevant features related to target speaker and disregarding the interfering speakers by simultaneously resetting and updating the cell state using the speaker embedding. For the second variant, we introduce a new gate within the LSTM cell, referred to as auxiliary-modulation gate. This gate modulates the information processing during cell state reset, aiming at learning the long-term and short-term discriminative features of the target speaker. Both in unidirectional and bidirectional mode, experimental results on 2-speaker mixtures, 3-speaker mixtures, and noisy mixtures (containing 1, 2, or 3 speakers) show that both proposed variants of LSTM cells outperform the standard LSTM cells for target speaker extraction, where the best performance is obtained using the auxiliary-gated LSTM cells.

Keywords