Time-domain adaptive attention network for single-channel speech separation

Kunpeng Wang; Hao Zhou; Jingxiang Cai; Wenna Li; Juan Yao

doi:10.1186/s13636-023-00283-w

EURASIP Journal on Audio, Speech, and Music Processing (May 2023)

Time-domain adaptive attention network for single-channel speech separation

Kunpeng Wang,
Hao Zhou,
Jingxiang Cai,
Wenna Li,
Juan Yao

Affiliations

Kunpeng Wang: School of Information Engineering, Southwest University of Science and Technology
Hao Zhou: School of Information Engineering, Southwest University of Science and Technology
Jingxiang Cai: School of Information Engineering, Southwest University of Science and Technology
Wenna Li: School of Information Engineering, Southwest University of Science and Technology
Juan Yao: School of Information Engineering, Southwest University of Science and Technology

DOI: https://doi.org/10.1186/s13636-023-00283-w
Journal volume & issue: Vol. 2023, no. 1
pp. 1 – 15

Abstract

Read online

Abstract Recent years have witnessed a great progress in single-channel speech separation by applying self-attention based networks. Despite the excellent performance in mining relevant long-sequence contextual information, self-attention networks cannot perfectly focus on subtle details in speech signals, such as temporal or spectral continuity, spectral structure, and timbre. To tackle this problem, we proposed a time-domain adaptive attention network (TAANet) with local and global attention network. Channel and spatial attention are introduced in local attention networks to focus on subtle details of the speech signals (frame-level features). In the global attention networks, a self-attention mechanism is used to explore the global associations of the speech contexts (utterance-level features). Moreover, we model the speech signal serially using multiple local and global attention blocks. This cascade structure enables our model to focus on local and global features adaptively, compared with other speech separation feature extraction methods, further boosting the separation performance. Versus other end-to-end speech separation methods, extensive experiments on benchmark datasets demonstrate that our approach obtains a superior result. (20.7 dB of SI-SNRi and 20.9 dB of SDRi on WSJ0-2mix).

Published in EURASIP Journal on Audio, Speech, and Music Processing

ISSN: 1687-4722 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Science: Physics: Acoustics. Sound; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://asmp-eurasipjournals.springeropen.com

About the journal

Abstract

Keywords