EURASIP Journal on Audio, Speech, and Music Processing (Feb 2024)

Sub-convolutional U-Net with transformer attention network for end-to-end single-channel speech enhancement

  • Sivaramakrishna Yecchuri,
  • Sunny Dayal Vanambathina

DOI
https://doi.org/10.1186/s13636-024-00331-z
Journal volume & issue
Vol. 2024, no. 1
pp. 1 – 15

Abstract

Read online

Abstract Recent advancements in deep learning-based speech enhancement models have extensively used attention mechanisms to achieve state-of-the-art methods by demonstrating their effectiveness. This paper proposes a transformer attention network based sub-convolutional U-Net (TANSCUNet) for speech enhancement. Instead of adopting conventional RNNs and temporal convolutional networks for sequence modeling, we employ a novel transformer-based attention network between the sub-convolutional U-Net encoder and decoder for better feature learning. More specifically, it is composed of several adaptive time―frequency attention modules and an adaptive hierarchical attention module, aiming to capture long-term time-frequency dependencies and further aggregate hierarchical contextual information. Additionally, a sub-convolutional encoder-decoder model used different kernel sizes to extract multi-scale local and contextual features from the noisy speech. The experimental results show that the proposed model outperforms several state-of-the-art methods.

Keywords