Information (Oct 2024)
Threshold-Based Combination of Ideal Binary Mask and Ideal Ratio Mask for Single-Channel Speech Separation
Abstract
An effective approach to addressing the speech separation problem is utilizing a time–frequency (T-F) mask. The ideal binary mask (IBM) and ideal ratio mask (IRM) have long been widely used to separate speech signals. However, the IBM is better at improving speech intelligibility, while the IRM is better at improving speech quality. To leverage their respective strengths and overcome weaknesses, we propose an ideal threshold-based mask (ITM) to combine these two masks. By adjusting two thresholds, these two masks are combined to jointly act on speech separation. We list the impact of using different threshold combinations on speech separation performance under ideal conditions and discuss a reasonable range for fine tuning the thresholds. By using masks as a training target, to evaluate the effectiveness of the proposed method, we conducted supervised speech separation experiments applying a deep neural network (DNN) and long short-term memory (LSTM), the results of which were measured by three objective indicators: the signal-to-distortion ratio (SDR), signal-to-interference ratio (SIR), and signal-to-artifact ratio improvement (SAR). Experimental results show that the proposed mask combines the strengths of the IBM and IRM and implies that the accuracy of speech separation can potentially be further improved by effectively leveraging the advantages of different masks.
Keywords