IEEE Access (Jan 2024)

Novel TransQT Neural Network: A Deep Learning Framework for Acoustic Echo Cancellation in Noisy Double-Talk Scenario

  • V. Soni Ishwarya,
  • Mohanaprasad Kothandaraman

DOI
https://doi.org/10.1109/ACCESS.2024.3445279
Journal volume & issue
Vol. 12
pp. 114735 – 114744

Abstract

Read online

Acoustic echo is a persistent issue in telecommunication that degrades the quality of speech and breaks down communication either entirely or for a period of time; therefore, acoustic echo cancellation (AEC) systems were developed. The demand for AEC has significantly risen after the global pandemic 2020 as the speaker and the listener communicate in unpredictable environments such as home environments where echo and noise significantly disrupt communication. Numerous AEC solutions have been proposed, including adaptive filters and deep learning techniques. However, their effectiveness is notably lowered during double-talk scenarios, where both nearend and farend speakers talk simultaneously, as well as in noisy environments. This paper proposes a novel transQT neural network (TNN), an end-to-end neural network that leverages the constant Q transform (CQT) and transformer-inspired self-attention module to eliminate the echo and noise in double-talk noisy scenarios. Additionally, it utilizes the smooth L1 loss function to enable efficient training and enhance the overall performance of the proposed model. In the proposed TNN, the CQT is used as the front end to convert the signal from time domain to time-frequency domain. The primary aim of CQT is to improve speech quality as it aligns more closely with the human auditory system due to its use of a logarithmic frequency scale. The attention module has been incorporated among the layers of the proposed models to focus on double-talk and noisy parts of speech. It aids the AEC model by making it easier to separate the clean target signal from the parts affected by double-talk and noise. The smooth L1 loss is employed to ensure smooth training and stable and efficient convergence. It is also less sensitive to variability in data, therefore reducing large errors and overall loss. An experimental implementation was conducted for both causal and non-causal scenarios. The proposed TNN model demonstrated superior performance in terms of speech quality, as measured by the perceptual evaluation of speech quality (PESQ) and it also showed a significant reduction of echo, quantified by echo return loss enhancement (ERLE). The performance was further evaluated using the correlation coefficient, which indicates the relationship between the clean and the echo signal.

Keywords