IEEE Access (Jan 2024)
Joint Far- and Near-End Speech and Listening Enhancement With Minimum Processing
Abstract
This paper considers speech and listening enhancement for signals captured in one noisy environment that must be played back to a listener in another noisy environment. In both far-end speech enhancement and near-end listening enhancement, overly prioritizing noise suppression or maximizing intelligibility can result in undue speech distortions and reduced quality, especially when intelligibility is already high in favorable noise conditions. To address this, the use of a minimum processing framework has been proposed with the aim of reducing noise or enhancing listening to a minimum degree while ensuring that a specified intelligibility level is maintained. Furthermore, results have shown that jointly considering both environments improves performance compared to blindly concatenating far- and near-end methods. In blind processing, near-end listening enhancement typically assumes that the far-end signal is devoid of noise, potentially leading to erroneously interpreting noise as speech. Additionally, if the transmitter and receiver are blind to each other’s presence, multiple instances of far- and near-end enhancement may occur and possibly work opposite directions, thus leading to degradations in the enhancement performance. In this paper, we perform a comprehensive exploration of our previously proposed joint far- and near-end minimum processing framework with systematic analysis and discussion. We derive a closed-form solution to the joint far- and near-end minimum processing optimization problem, with mean-square error processing penalty, a speech intelligibility constraint based on the approximated speech intelligibility index, and a noise power constraint. Performance was systematically studied using objective measures and listening tests for intelligibility, listening effort, and quality. We compared against relevant joint and blind methods with minimum and maximum processing. The results suggest that minimum processing achieves intelligibility comparable to maximum processing while preserving quality in higher signal-to-noise ratios, indicating its benefits in end-to-end communication. Joint processing provides advantages in objective estimated speech intelligibility for the minimum processing case, but not for maximum processing. However, no significant differences were observed in listening test results. This suggests that in certain speech and listening scenarios, it is feasible to optimize near- and far-end aspects separately, offering a more practical and convenient approach compared to joint optimization.
Keywords