IEEE Access (Jan 2020)
Spectral Flux-Based Convolutional Neural Network Architecture for Speech Source Localization and its Real-Time Implementation
Abstract
In this article, we present a real-time convolutional neural network (CNN)-based Speech source localization (SSL) algorithm that is robust to realistic background acoustic conditions (noise and reverberation). We have implemented and tested the proposed method on a prototype (Raspberry Pi) for real-time operation. We have used the combination of the imaginary-real coefficients of the short-time Fourier transform (STFT) and Spectral Flux (SF) with delay-and-sum (DAS) beamforming as the input feature. We have trained the CNN model using noisy speech recordings collected from different rooms and inference on an unseen room. We provide quantitative comparison with five other previously published SSL algorithms under several realistic noisy conditions, and show significant improvements by incorporating the Spectral Flux (SF) with beamforming as an additional feature to learn temporal variation in speech spectra. We perform real-time inferencing of our CNN model on the prototyped platform with low latency (21 milliseconds (ms) per frame with a frame length of 30 ms) and high accuracy (i.e. 89.68% under Babble noise condition at 5dB SNR). Lastly, we provide a detailed explanation of real-time implementation and on-device performance (including peak power consumption metrics) that sets this work apart from previously published works. This work has several notable implications for improving the audio-processing algorithms for portable battery-operated Smart loudspeakers and hearing improvement (HI) devices.
Keywords