IEEE Access (Jan 2024)

Spatio-Temporal Features Representation Using Recurrent Capsules for Monaural Speech Enhancement

  • Jawad Ali,
  • Nasir Saleem,
  • Sami Bourouis,
  • Eatedal Alabdulkreem,
  • Hela El Mannai,
  • Sami Dhahbi

DOI
https://doi.org/10.1109/ACCESS.2024.3361286
Journal volume & issue
Vol. 12
pp. 21287 – 21303

Abstract

Read online

Single-channel speech enhancement is important for modern communication systems and has received a lot of attention. A convolutional neural network (CNN) successfully learns feature representations from speech spectrograms but loses spatial information due to distortion, which is important for humans to understand speech. Speech feature learning is an important ongoing research to capture higher-level representations of speech that go beyond conventional techniques. By considering the hierarchical structure and temporal relationships within speech signals, capsule networks (CapsNets) have the potential to provide more expressive and context-aware feature representations. By considering the advantages of CapNets over CNN, this study presents a model for monaural speech enhancement that keeps spatial information in a capsule and uses dynamic routing to pass it to higher layers. Dynamic routing replaces the pooling recurrent hidden states to get speech features from the outputs of the capsule. Leveraging long-term contexts provides identification of the target speaker. Therefore, a gated recurrent layer, gated recurrent unit (GRU), or long-short-term memory (LSTM), is placed above the CNN module and next to the capsule module in the architecture. This makes it viable to extract spatial features and long-term temporal dynamics. The suggested convolutional recurrent CapNet performs better compared to the models based on CNNs and recurrent neural networks. The suggested speech enhancement produces considerably better speech quality and intelligibility. With the LibriSpeech and VoiceBank+DEMAND databases, the suggested speech enhancement improves the intelligibility and quality by 18.33% and (0.94) 36.82% over the noisy mixtures.

Keywords