Bangla Speech Emotion Recognition and Cross-Lingual Study Using Deep CNN and BLSTM Networks

Sadia Sultana; M. Zafar Iqbal; M. Reza Selim; Md. Mijanur Rashid; M. Shahidur Rahman

doi:10.1109/ACCESS.2021.3136251

IEEE Access (Jan 2022)

Bangla Speech Emotion Recognition and Cross-Lingual Study Using Deep CNN and BLSTM Networks

Sadia Sultana,
M. Zafar Iqbal,
M. Reza Selim,
Md. Mijanur Rashid,
M. Shahidur Rahman

Affiliations

Sadia Sultana: ORCiD; Department of Computer Science and Engineering, Shahjalal University of Science and Technology, Sylhet, Bangladesh
M. Zafar Iqbal: ORCiD; Department of Computer Science and Engineering, Shahjalal University of Science and Technology, Sylhet, Bangladesh
M. Reza Selim: ORCiD; Department of Computer Science and Engineering, Shahjalal University of Science and Technology, Sylhet, Bangladesh
Md. Mijanur Rashid: ORCiD; REPL Group, Accenture, Henley-in-Arden, U.K.
M. Shahidur Rahman: ORCiD; Department of Computer Science and Engineering, Shahjalal University of Science and Technology, Sylhet, Bangladesh

DOI: https://doi.org/10.1109/ACCESS.2021.3136251
Journal volume & issue: Vol. 10
pp. 564 – 578

Abstract

Read online

In this study, we have presented a deep learning-based implementation for speech emotion recognition (SER). The system combines a deep convolutional neural network (DCNN) and a bidirectional long-short term memory (BLSTM) network with a time-distributed flatten (TDF) layer. The proposed model has been applied for the recently built audio-only Bangla emotional speech corpus SUBESCO. A series of experiments were carried out to analyze all the models discussed in this paper for baseline, cross-lingual, and multilingual training-testing setups. The experimental results reveal that the model with a TDF layer achieves better performance compared with other state-of-the-art CNN-based SER models which can work on both temporal and sequential representation of emotions. For the cross-lingual experiments, cross-corpus training, multi-corpus training, and transfer learning were employed for the Bangla and English languages using the SUBESCO and RAVDESS datasets. The proposed model has attained a state-of-the-art perceptual efficiency achieving weighted accuracies (WAs) of 86.9%, and 82.7% for the SUBESCO and RAVDESS datasets, respectively.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords