Spectral Flux-Based Convolutional Neural Network Architecture for Speech Source Localization and its Real-Time Implementation

Yiya Hao; Abdullah Kucuk; Anshuman Ganguly; Issa M. S. Panahi

doi:10.1109/ACCESS.2020.3033533

IEEE Access (Jan 2020)

Spectral Flux-Based Convolutional Neural Network Architecture for Speech Source Localization and its Real-Time Implementation

Yiya Hao,
Abdullah Kucuk,
Anshuman Ganguly,
Issa M. S. Panahi

Affiliations

Yiya Hao: ORCiD; Department of Electrical and Computer Engineering, The University of Texas at Dallas, Richardson, TX, USA
Abdullah Kucuk: ORCiD; Department of Electrical and Computer Engineering, The University of Texas at Dallas, Richardson, TX, USA
Anshuman Ganguly: Department of Electrical and Computer Engineering, The University of Texas at Dallas, Richardson, TX, USA
Issa M. S. Panahi: Department of Electrical and Computer Engineering, The University of Texas at Dallas, Richardson, TX, USA

DOI: https://doi.org/10.1109/ACCESS.2020.3033533
Journal volume & issue: Vol. 8
pp. 197047 – 197058

Abstract

Read online

In this article, we present a real-time convolutional neural network (CNN)-based Speech source localization (SSL) algorithm that is robust to realistic background acoustic conditions (noise and reverberation). We have implemented and tested the proposed method on a prototype (Raspberry Pi) for real-time operation. We have used the combination of the imaginary-real coefficients of the short-time Fourier transform (STFT) and Spectral Flux (SF) with delay-and-sum (DAS) beamforming as the input feature. We have trained the CNN model using noisy speech recordings collected from different rooms and inference on an unseen room. We provide quantitative comparison with five other previously published SSL algorithms under several realistic noisy conditions, and show significant improvements by incorporating the Spectral Flux (SF) with beamforming as an additional feature to learn temporal variation in speech spectra. We perform real-time inferencing of our CNN model on the prototyped platform with low latency (21 milliseconds (ms) per frame with a frame length of 30 ms) and high accuracy (i.e. 89.68% under Babble noise condition at 5dB SNR). Lastly, we provide a detailed explanation of real-time implementation and on-device performance (including peak power consumption metrics) that sets this work apart from previously published works. This work has several notable implications for improving the audio-processing algorithms for portable battery-operated Smart loudspeakers and hearing improvement (HI) devices.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords