EURASIP Journal on Audio, Speech, and Music Processing (May 2018)

Learning long-term filter banks for audio source separation and audio scene classification

  • Teng Zhang,
  • Ji Wu

DOI
https://doi.org/10.1186/s13636-018-0127-7
Journal volume & issue
Vol. 2018, no. 1
pp. 1 – 13

Abstract

Read online

Abstract ■■■ Filter banks on short-time Fourier transform (STFT) spectrogram have long been studied to analyze and process audios. The frameshift in STFT procedure determines the temporal resolution. However, in many discriminative audio applications, long-term time and frequency correlations are needed. The authors in this work use Toeplitz matrix motivated filter banks to extract long-term time and frequency information. This paper investigates the mechanism of long-term filter banks and the corresponding spectrogram reconstruction method. The time duration and shape of the filter banks are well designed and learned using neural networks. We test our approach on different tasks. The spectrogram reconstruction error in audio source separation task is reduced by relatively 6.7% and the classification error in audio scene classification task is reduced by relatively 6.5%, when compared with the traditional frequency filter banks. The experiments also show that the time duration of long-term filter banks in classification task is much larger than in reconstruction task.

Keywords