Threshold-Based Combination of Ideal Binary Mask and Ideal Ratio Mask for Single-Channel Speech Separation

Peng Chen; Binh Thien Nguyen; Kenta Iwai; Takanobu Nishiura

doi:10.3390/info15100608

Information (Oct 2024)

Threshold-Based Combination of Ideal Binary Mask and Ideal Ratio Mask for Single-Channel Speech Separation

Peng Chen,
Binh Thien Nguyen,
Kenta Iwai,
Takanobu Nishiura

Affiliations

Peng Chen: Graduate School of Information Science and Engineering, Ritsumeikan University, Ibaraki 567-8570, Osaka, Japan
Binh Thien Nguyen: College of Information Science and Engineering, Ritsumeikan University, Ibaraki 567-8570, Osaka, Japan
Kenta Iwai: College of Information Science and Engineering, Ritsumeikan University, Ibaraki 567-8570, Osaka, Japan
Takanobu Nishiura: College of Information Science and Engineering, Ritsumeikan University, Ibaraki 567-8570, Osaka, Japan

DOI: https://doi.org/10.3390/info15100608
Journal volume & issue: Vol. 15, no. 10
p. 608

Abstract

Read online

An effective approach to addressing the speech separation problem is utilizing a time–frequency (T-F) mask. The ideal binary mask (IBM) and ideal ratio mask (IRM) have long been widely used to separate speech signals. However, the IBM is better at improving speech intelligibility, while the IRM is better at improving speech quality. To leverage their respective strengths and overcome weaknesses, we propose an ideal threshold-based mask (ITM) to combine these two masks. By adjusting two thresholds, these two masks are combined to jointly act on speech separation. We list the impact of using different threshold combinations on speech separation performance under ideal conditions and discuss a reasonable range for fine tuning the thresholds. By using masks as a training target, to evaluate the effectiveness of the proposed method, we conducted supervised speech separation experiments applying a deep neural network (DNN) and long short-term memory (LSTM), the results of which were measured by three objective indicators: the signal-to-distortion ratio (SDR), signal-to-interference ratio (SIR), and signal-to-artifact ratio improvement (SAR). Experimental results show that the proposed mask combines the strengths of the IBM and IRM and implies that the accuracy of speech separation can potentially be further improved by effectively leveraging the advantages of different masks.

Published in Information

ISSN: 2078-2489 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: http://www.mdpi.com/journal/information/

About the journal

Abstract

Keywords