IEEE Access (Jan 2022)

Bangla Speech Emotion Recognition and Cross-Lingual Study Using Deep CNN and BLSTM Networks

  • Sadia Sultana,
  • M. Zafar Iqbal,
  • M. Reza Selim,
  • Md. Mijanur Rashid,
  • M. Shahidur Rahman

DOI
https://doi.org/10.1109/ACCESS.2021.3136251
Journal volume & issue
Vol. 10
pp. 564 – 578

Abstract

Read online

In this study, we have presented a deep learning-based implementation for speech emotion recognition (SER). The system combines a deep convolutional neural network (DCNN) and a bidirectional long-short term memory (BLSTM) network with a time-distributed flatten (TDF) layer. The proposed model has been applied for the recently built audio-only Bangla emotional speech corpus SUBESCO. A series of experiments were carried out to analyze all the models discussed in this paper for baseline, cross-lingual, and multilingual training-testing setups. The experimental results reveal that the model with a TDF layer achieves better performance compared with other state-of-the-art CNN-based SER models which can work on both temporal and sequential representation of emotions. For the cross-lingual experiments, cross-corpus training, multi-corpus training, and transfer learning were employed for the Bangla and English languages using the SUBESCO and RAVDESS datasets. The proposed model has attained a state-of-the-art perceptual efficiency achieving weighted accuracies (WAs) of 86.9%, and 82.7% for the SUBESCO and RAVDESS datasets, respectively.

Keywords