IEEE Access (Jan 2023)

Multi-Attention Bottleneck for Gated Convolutional Encoder-Decoder-Based Speech Enhancement

  • Nasir Saleem,
  • Teddy Surya Gunawan,
  • Muhammad Shafi,
  • Sami Bourouis,
  • Aymen Trigui

DOI
https://doi.org/10.1109/ACCESS.2023.3324210
Journal volume & issue
Vol. 11
pp. 114172 – 114186

Abstract

Read online

Convolutional encoder-decoder (CED) has emerged as a powerful architecture, particularly in speech enhancement (SE), which aims to improve the intelligibility and quality and intelligibility of noise-contaminated speech. This architecture leverages the strength of the convolutional neural networks (CNNs) in capturing high-level features. Usually, the CED architectures use the gated recurrent unit (GRU) or long-short-term memory (LSTM) as a bottleneck to capture temporal dependencies, enabling a SE model to effectively learn the dynamics and long-term temporal dependencies in the speech signal. However, Transformers neural networks with self-attention effectively capture long-term temporal dependencies. This study proposes a multi-attention bottleneck (MAB) comprised of a self-attention Transformer powered by a time-frequency attention (TFA) module followed by a channel attention module (CAM) to focus on the important features. The proposed bottleneck (MAB) is integrated into a CED architecture and named MAB-CED. The MAB-CED uses an encoder-decoder structure including a shared encoder and two decoders, where one decoder is dedicated to spectral masking and the other is used for spectral mapping. Convolutional Gated Linear Units (ConvGLU) and Deconvolutional Gated Linear Units (DeconvGLU) are used to construct the encoder-decoder framework. The outputs of two decoders are coupled by applying coherent averaging to synthesize the enhanced speech signal. The proposed speech enhancement is examined using two databases, VoiceBank+DEMAND and LibriSpeech. The results show that the proposed speech enhancement outperforms the benchmarks in terms of intelligibility and quality at various input SNRs. This indicates the performance of the proposed MAB-CED at improving the average PESQ by 0.55 (22.85%) with VoiceBank+DEMAND and by 0.58 (23.79%) with LibriSpeech. The average STOI is improved by 9.63% (VoiceBank+DEMAND) and 9.78% (LibriSpeech) over the noisy mixtures.

Keywords