Towards cross-modal pre-training and learning tempo-spatial characteristics for audio recognition with convolutional and recurrent neural networks

Shahin Amiriparian; Maurice Gerczuk; Sandra Ottl; Lukas Stappen; Alice Baird; Lukas Koebe; Björn Schuller

doi:10.1186/s13636-020-00186-0

EURASIP Journal on Audio, Speech, and Music Processing (Dec 2020)

Towards cross-modal pre-training and learning tempo-spatial characteristics for audio recognition with convolutional and recurrent neural networks

Shahin Amiriparian,
Maurice Gerczuk,
Sandra Ottl,
Lukas Stappen,
Alice Baird,
Lukas Koebe,
Björn Schuller

Affiliations

Shahin Amiriparian: Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg
Maurice Gerczuk: Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg
Sandra Ottl: Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg
Lukas Stappen: Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg
Alice Baird: Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg
Lukas Koebe: Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg
Björn Schuller: Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg

DOI: https://doi.org/10.1186/s13636-020-00186-0
Journal volume & issue: Vol. 2020, no. 1
pp. 1 – 11

Abstract

Read online

Abstract In this paper, we investigate the performance of two deep learning paradigms for the audio-based tasks of acoustic scene, environmental sound and domestic activity classification. In particular, a convolutional recurrent neural network (CRNN) and pre-trained convolutional neural networks (CNNs) are utilised. The CRNN is directly trained on Mel-spectrograms of the audio samples. For the pre-trained CNNs, the activations of one of the top layers of various architectures are extracted as feature vectors and used for training a linear support vector machine (SVM).Moreover, the predictions of the two models—the class probabilities predicted by the CRNN and the decision function of the SVM—are combined in a decision-level fusion to achieve the final prediction. For the pre-trained CNN networks we use as feature extractors, we further evaluate the effects of a range of configuration options, including the choice of the pre-training corpus. The system is evaluated on the acoustic scene classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017) workshop, ESC-50 and the multi-channel acoustic recordings from DCASE 2018, task 5. We have refrained from additional data augmentation as our primary goal is to analyse the general performance of the proposed system on different datasets. We show that using our system, it is possible to achieve competitive performance on all datasets and demonstrate the complementarity of CRNNs and ImageNet pre-trained CNNs for acoustic classification tasks. We further find that in some cases, CNNs pre-trained on ImageNet can serve as more powerful feature extractors than AudioSet models. Finally, ImageNet pre-training is complimentary to more domain-specific knowledge, either in the form of the convolutional recurrent neural network (CRNN) trained directly on the target data or the AudioSet pre-trained models. In this regard, our findings indicate possible benefits of applying cross-modal pre-training of large CNNs to acoustic analysis tasks.

Published in EURASIP Journal on Audio, Speech, and Music Processing

ISSN: 1687-4722 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Science: Physics: Acoustics. Sound; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://asmp-eurasipjournals.springeropen.com

About the journal

Abstract

Keywords