Variants of LSTM cells for single-channel speaker-conditioned target speaker extraction

Ragini Sinha; Christian Rollwage; Simon Doclo

doi:10.1186/s13636-024-00384-0

EURASIP Journal on Audio, Speech, and Music Processing (Dec 2024)

Variants of LSTM cells for single-channel speaker-conditioned target speaker extraction

Ragini Sinha,
Christian Rollwage,
Simon Doclo

Affiliations

Ragini Sinha: Fraunhofer Institute for Digital Media Technology IDMT, Oldenburg Branch for Hearing, Speech and Audio Technology HSA
Christian Rollwage: Fraunhofer Institute for Digital Media Technology IDMT, Oldenburg Branch for Hearing, Speech and Audio Technology HSA
Simon Doclo: Fraunhofer Institute for Digital Media Technology IDMT, Oldenburg Branch for Hearing, Speech and Audio Technology HSA

DOI: https://doi.org/10.1186/s13636-024-00384-0
Journal volume & issue: Vol. 2024, no. 1
pp. 1 – 13

Abstract

Read online

Abstract Speaker-conditioned target speaker extraction aims at estimating the target speaker from a mixture of speakers utilizing auxiliary information about the target speaker. In this paper, we consider a single-channel target speaker extraction system consisting of a speaker embedder network and a speaker separator network. Instead of using standard long short-term memory (LSTM) cells in the separator network, we propose two variants of LSTM cells that are customized for speaker-conditioned target speaker extraction. The first variant customizes both the forget gate and input gate of the LSTM cell, aiming at retaining only relevant features related to target speaker and disregarding the interfering speakers by simultaneously resetting and updating the cell state using the speaker embedding. For the second variant, we introduce a new gate within the LSTM cell, referred to as auxiliary-modulation gate. This gate modulates the information processing during cell state reset, aiming at learning the long-term and short-term discriminative features of the target speaker. Both in unidirectional and bidirectional mode, experimental results on 2-speaker mixtures, 3-speaker mixtures, and noisy mixtures (containing 1, 2, or 3 speakers) show that both proposed variants of LSTM cells outperform the standard LSTM cells for target speaker extraction, where the best performance is obtained using the auxiliary-gated LSTM cells.

Published in EURASIP Journal on Audio, Speech, and Music Processing

ISSN: 1687-4722 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Science: Physics: Acoustics. Sound; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://asmp-eurasipjournals.springeropen.com

About the journal

Abstract

Keywords