IEEE Access (Jan 2024)
Speech Enhancement Based on a Joint Two-Stage CRN+DNN-DEC Model and a New Constrained Phase-Sensitive Magnitude Ratio Mask
Abstract
In this paper, we propose a jointly-optimized stacked-two-stage speech enhancement. In the first stage, a convolutional recurrent network (CRN)-based masking is integrated with the signal analysis (fast Fourier transform (FFT)) and resynthesis (inverse FFT (IFFT)) parts as extra joint layers (FFT-CRN-IFFT). This joint FFT-CRN-IFFT model is used to separate time domain (TD) speech and noise signals. Additionally, we propose new constrained phase-sensitive magnitude ratio masks (cPSIRMs) for speech and noise sources, which are estimated at this stage by the CRN in relation to the ultimate time-domain signals. In the second stage, a deep neural network integrated with the decoder layers of a deep autoencoder (DNN-DEC) is used to further enhance the separated signals and reduce distortions. We also introduce a supervised multi-objective step-wise learning approach to gradually map the input to the main output of the unified two-stage model (CRN+DNN-DEC), through multiple training steps (e.g., a 4-step mapping as our final suggestion). In this approach, the learned layers of each step serve as pre-training for the next step, with the final step fine-tuning the entire integrated end-to-end model. This unified model not only estimates low-level structural features as direct intermediate targets but also high-level signals as main targets. Experimental results show that the proposed approaches achieve up to a 0.6 improvement in the average perceptual evaluation of speech quality (PESQ) compared to the prior methods.
Keywords