IEEE Access (Jan 2023)
Multi-Channel Bin-Wise Speech Separation Combining Time-Frequency Masking and Beamforming
Abstract
This paper presents a novel Blind Source Separation method that can handle convolutive mixtures that may be underdetermined. Our method combines TF masking and beamforming and exploits the source signals sparsity in the Time-Frequency (TF) domain. Remarkable performance can be achieved by TF masking-based methods, even in the underdetermined case, although they tend to generate unwanted artifacts at the level of the separated signals. Besides, beamforming techniques can achieve satisfactory performance only in the overdetermined and determined cases without distorting the estimated signals. By combining these two approaches, we can leverage their respective strengths. Firstly, we exploit the source signals sparsity in the TF domain to estimate probabilistic “bin-wise” masks by modeling the frequency observation vectors with a complex Gaussian Mixture Model and using an EM algorithm. However, due to the sensitivity of the EM algorithm to initialization, we propose properly selecting the initial values of the model parameters using Hermitian angles between the frequency observation vectors and a reference vector. Then, we utilize the estimated TF masks to estimate the Relative Transfer Functions of each source. Finally, we propose a new technique to obtain an estimate of the spatial images of the separated sources, which can be regarded as an underdetermined extension of the Linearly Constrained Minimum Power beamformer. Good performance was observed in test results for our method, both in the determined and underdetermined cases, compared to various existing methods with similar working hypotheses.
Keywords