IEEE Access (Jan 2023)

A Survey of Audio Classification Using Deep Learning

  • Khalid Zaman,
  • Melike Sah,
  • Cem Direkoglu,
  • Masashi Unoki

DOI
https://doi.org/10.1109/ACCESS.2023.3318015
Journal volume & issue
Vol. 11
pp. 106620 – 106649

Abstract

Read online

Deep learning can be used for audio signal classification in a variety of ways. It can be used to detect and classify various types of audio signals such as speech, music, and environmental sounds. Deep learning models are able to learn complex patterns of audio signals and can be trained on large datasets to achieve high accuracy. To employ deep learning for audio signal classification, the audio signal must first be represented in a suitable form. This can be done using signal representation techniques such as using spectrograms, Mel-frequency Cepstral coefficients, linear predictive coding, and wavelet decomposition. Once the audio signal is represented in a suitable form, it can then be fed into a deep learning model. Various deep learning models can be utilized for audio classification. We provide an extensive survey of current deep learning models used for a variety of audio classification tasks. In particular, we focus on works published under five different deep neural network architectures, namely Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Autoencoders, Transformers and Hybrid Models (hybrid deep learning models and hybrid deep learning models with traditional classifiers). CNNs can be used to classify audio signals into different categories such as speech, music, and environmental sounds. They can also be used for speech recognition, speaker identification, and emotion recognition. RNNs are widely used for audio classification and audio segmentation. RNN models can capture temporal patterns of audio signals and be used to classify audio segments into different categories. Another approach is to use autoencoders for learning the features of audio signals and then classifying the signals into different categories. Transformers are also well-suited for audio classification. In particular, temporal and frequency features can be extracted to identify the characteristics of the audio signals. Finally, hybrid models for audio classification either combine various deep learning architectures (i.e. CNN-RNN) or combine deep learning models with traditional machine learning techniques (i.e. CNN-Support Vector Machine). These hybrid models take advantage of the strengths of different architectures while avoiding their weaknesses. Existing literature under different categories of deep learning are summarized and compared in detail.

Keywords