Joint Deep Neural Network for Single-Channel Speech Separation on Masking-Based Training Targets

Peng Chen; Binh Thien Nguyen; Yuting Geng; Kenta Iwai; Takanobu Nishiura

doi:10.1109/ACCESS.2024.3479292

IEEE Access (Jan 2024)

Joint Deep Neural Network for Single-Channel Speech Separation on Masking-Based Training Targets

Peng Chen,
Binh Thien Nguyen,
Yuting Geng,
Kenta Iwai,
Takanobu Nishiura

Affiliations

Peng Chen: ORCiD; Graduate School of Information Science and Engineering, Ritsumeikan University, Ibaraki, Osaka, Japan
Binh Thien Nguyen: College of Information Science and Engineering, Ritsumeikan University, Ibaraki, Osaka, Japan
Yuting Geng: College of Information Science and Engineering, Ritsumeikan University, Ibaraki, Osaka, Japan
Kenta Iwai: ORCiD; College of Information Science and Engineering, Ritsumeikan University, Ibaraki, Osaka, Japan
Takanobu Nishiura: ORCiD; College of Information Science and Engineering, Ritsumeikan University, Ibaraki, Osaka, Japan

DOI: https://doi.org/10.1109/ACCESS.2024.3479292
Journal volume & issue: Vol. 12
pp. 152036 – 152044

Abstract

Read online

Single-channel speech separation can be adopted in many applications. Time-frequency (T-F) masking is an effective method for single-channel speech separation. With advancements in deep learning, T-F masks have become used as a training target, achieving notable separation results. Among the numerous masks that have been proposed, the ideal binary mask (IBM), ideal ratio mask (IRM), Wiener filter (WF) and spectral magnitude mask (SMM) are commonly used and have proven effective, though their separation performance varies depending on the speech mixture and separation model. The existing approach mainly utilizes a single network to approximate the mask of the target speech. However, in mixed speech, there are segments where speech is mixed with other speech, segments where speech is mixed with silent intervals, and segments where high signal-to-noise ratio (SNR) speech is mixed due to pauses and variations in the speakers’ intonation and emphasis. In this paper, we attempt to use different networks to handle speech segments containing various mixtures. In addition to the existing network, we introduce a network (using the Rectified Linear Unit as activation functions) to specifically address segments containing a mixture of speech and silence, as well as segments with high SNR speech mixtures. We conducted evaluation experiments on the speech separation of two speakers using the four aforementioned masks as training targets. The performance improvements observed in the evaluation experiments demonstrate the effectiveness of our proposed method based on the joint network compared to the conventional method based on the single network.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords