Leveraging Sparse Approximation for Monaural Overlapped Speech Separation From Auditory Perspective

Hiroshi Sekiguchi; Yoshiaki Narusue; Hiroyuki Morikawa

doi:10.1109/ACCESS.2023.3330645

IEEE Access (Jan 2023)

Leveraging Sparse Approximation for Monaural Overlapped Speech Separation From Auditory Perspective

Hiroshi Sekiguchi,
Yoshiaki Narusue,
Hiroyuki Morikawa

Affiliations

Hiroshi Sekiguchi: ORCiD; Graduate School of Engineering, The University of Tokyo, Bunkyo, Tokyo, Japan
Yoshiaki Narusue: ORCiD; Graduate School of Engineering, The University of Tokyo, Bunkyo, Tokyo, Japan
Hiroyuki Morikawa: ORCiD; Graduate School of Engineering, The University of Tokyo, Bunkyo, Tokyo, Japan

DOI: https://doi.org/10.1109/ACCESS.2023.3330645
Journal volume & issue: Vol. 11
pp. 124748 – 124759

Abstract

Read online

Neuroscience suggests that the sparse behavior of a neural population underlies the mechanisms of the auditory system for monaural overlapped speech separation. This study investigates leveraging sparse approximation to improve speech separation in a conventional deep learning algorithm. We develop a combined model that embeds a sparse approximation algorithm, a multilayered iterative soft thresholding algorithm (ML-ISTA), into a conventional time-domain-based speech separation algorithm, Conv-TasNet. Adopting ML-ISTA is a crucial enabler for the embedding process and helps avoid solving a bi-level optimization problem comprising sparse approximation and speech separation. ML-ISTA performs sparse approximation through forward calculations, thereby eliminating the optimization of sparse approximation. The combined model is trained with WSJ0-2mix, the Wall Street Journal English corpus for two-speaker mixed speech without noisy or reverberant interference, to clarify the proposed method’s performance. The model demonstrates that sparse approximation improves separation performance regardless of the approximation setting. The peak performance of the model exceeds that of Conv-TasNet by 1.1% to 4.7% in four speech quality criteria. Moreover, sparse approximation accelerates the combined model performance gain at the early stages of learning relative to Conv-TasNet. The primary novelty of the study is embedding the sparse approximation algorithm, ML-ISTA, into a deep-learning-based speech separation framework and the experimental proof of improved separation performance in the proposed algorithm.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords