DeepCNN: Spectro‐temporal feature representation for speech emotion recognition

Nasir Saleem; Jiechao Gao; Rizwana Irfan; Ahmad Almadhor; Hafiz Tayyab Rauf; Yudong Zhang; Seifedine Kadry

doi:10.1049/cit2.12233

CAAI Transactions on Intelligence Technology (Jun 2023)

DeepCNN: Spectro‐temporal feature representation for speech emotion recognition

Nasir Saleem,
Jiechao Gao,
Rizwana Irfan,
Ahmad Almadhor,
Hafiz Tayyab Rauf,
Yudong Zhang,
Seifedine Kadry

Affiliations

Nasir Saleem: Department of Electrical Engineering Faculty of Engineering and Technology Gomal University D.I. Khan Pakistan
Jiechao Gao: Department of Computer Science University of Virginia Charlottesville Virginia USA
Rizwana Irfan: Department of Information Technology College of Computing and Information Technology at Khulais University of Jeddah Jeddah Saudi Arabia
Ahmad Almadhor: Department of Computer Engineering and Networks College of Computer and Information Sciences Jouf University Skaka Aljouf Saudi Arabia
Hafiz Tayyab Rauf: Independent Researcher UK
Yudong Zhang: School of Computing and Mathematical Sciences University of Leicester Leicester UK
Seifedine Kadry: Department of Applied Data Science Noroff University College Kristiansand Norway

DOI: https://doi.org/10.1049/cit2.12233
Journal volume & issue: Vol. 8, no. 2
pp. 401 – 417

Abstract

Read online

Abstract Speech emotion recognition (SER) is an important research problem in human‐computer interaction systems. The representation and extraction of features are significant challenges in SER systems. Despite the promising results of recent studies, they generally do not leverage progressive fusion techniques for effective feature representation and increasing receptive fields. To mitigate this problem, this article proposes DeepCNN, which is a fusion of spectral and temporal features of emotional speech by parallelising convolutional neural networks (CNNs) and a convolution layer‐based transformer. Two parallel CNNs are applied to extract the spectral features (2D‐CNN) and temporal features (1D‐CNN) representations. A 2D‐convolution layer‐based transformer module extracts spectro‐temporal features and concatenates them with features from parallel CNNs. The learnt low‐level concatenated features are then applied to a deep framework of convolutional blocks, which retrieves high‐level feature representation and subsequently categorises the emotional states using an attention gated recurrent unit and classification layer. This fusion technique results in a deeper hierarchical feature representation at a lower computational cost while simultaneously expanding the filter depth and reducing the feature map. The Berlin Database of Emotional Speech (EMO‐BD) and Interactive Emotional Dyadic Motion Capture (IEMOCAP) datasets are used in experiments to recognise distinct speech emotions. With efficient spectral and temporal feature representation, the proposed SER model achieves 94.2% accuracy for different emotions on the EMO‐BD and 81.1% accuracy on the IEMOCAP dataset respectively. The proposed SER system, DeepCNN, outperforms the baseline SER systems in terms of emotion recognition accuracy on the EMO‐BD and IEMOCAP datasets.

Published in CAAI Transactions on Intelligence Technology

ISSN: 2468-2322 (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Language and Literature: Philology. Linguistics: Computational linguistics. Natural language processing; Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://ietresearch.onlinelibrary.wiley.com/journal/24682322

About the journal

Abstract

Keywords