IEEE Access (Jan 2022)

Robust Speech Emotion Recognition Using CNN+LSTM Based on Stochastic Fractal Search Optimization Algorithm

  • Abdelaziz A. Abdelhamid,
  • El-Sayed M. El-Kenawy,
  • Bandar Alotaibi,
  • Ghada M. Amer,
  • Mahmoud Y. Abdelkader,
  • Abdelhameed Ibrahim,
  • Marwa Metwally Eid

DOI
https://doi.org/10.1109/ACCESS.2022.3172954
Journal volume & issue
Vol. 10
pp. 49265 – 49284

Abstract

Read online

One of the main challenges facing the current approaches of speech emotion recognition is the lack of a dataset large enough to train the currently available deep learning models properly. Therefore, this paper proposes a new data augmentation algorithm to enrich the speech emotions dataset with more sam Department, College of Computing and ples through a careful addition of noise fractions. In addition, the hyperparameters of the currently available deep learning models are either handcrafted or adjusted during the training process. However, this approach does not guarantee finding the best settings for these parameters. Therefore, we propose an optimized deep learning model in which the hyperparameters are optimized to find their best settings and thus achieve more recognition results. This deep learning model consists of a convolutional neural network (CNN) composed of four local feature-learning blocks and a long short-term memory (LSTM) layer for learning local and long-term correlations in the log Mel-spectrogram of the input speech samples. To improve the performance of this deep network, the learning rate and label smoothing regularization factor are optimized using the recently emerged stochastic fractal search (SFS)-guided whale optimization algorithm (WOA). The strength of this algorithm is the ability to balance between the exploration and exploitation of the search agents’ positions to guarantee to reach the optimal global solution. To prove the effectiveness of the proposed approach, four speech emotion datasets, namely, IEMOCAP, Emo-DB, RAVDESS, and SAVEE, are incorporated in the conducted experiments. Experimental results confirmed the superiority of the proposed approach when compared with state-of-the-art approaches. Based on the four datasets, the achieved recognition accuracies are 98.13%, 99.76%, 99.47%, and 99.50%, respectively. Moreover, a statistical analysis of the achieved results is provided to emphasize the stability of the proposed approach.

Keywords