A Parallel-Model Speech Emotion Recognition Network Based on Feature Clustering

Li-Min Zhang; Giap Weng Ng; Yu-Beng Leau; Hao Yan

doi:10.1109/ACCESS.2023.3294274

IEEE Access (Jan 2023)

A Parallel-Model Speech Emotion Recognition Network Based on Feature Clustering

Li-Min Zhang,
Giap Weng Ng,
Yu-Beng Leau,
Hao Yan

Affiliations

Li-Min Zhang: ORCiD; Key Laboratory for Artificial Intelligence and Cognitive Neuroscience of Language, Xi’an International Studies University, Xi’an, China
Giap Weng Ng: ORCiD; Faculty of Computing and Informatics, Universiti Malaysia Sabah, Kota Kinabalu, Sabah, Malaysia
Yu-Beng Leau: ORCiD; Faculty of Computing and Informatics, Universiti Malaysia Sabah, Kota Kinabalu, Sabah, Malaysia
Hao Yan: ORCiD; Key Laboratory for Artificial Intelligence and Cognitive Neuroscience of Language, Xi’an International Studies University, Xi’an, China

DOI: https://doi.org/10.1109/ACCESS.2023.3294274
Journal volume & issue: Vol. 11
pp. 71224 – 71234

Abstract

Read online

Speech Emotion Recognition (SER) is a common aspect of human-computer interaction and has significant applications in fields such as healthcare, education, and elder care. Although researchers have made progress in speech emotion feature extraction and model identification, they have struggled to create an SER system with satisfactory recognition accuracy. To address this issue, we proposed a novel algorithm called F-Emotion to select speech emotion features and established a parallel deep learning model to recognize different types of emotions. We first extracted the emotion features from speech and calculated the F-Emotion value for each feature. These values were then used to determine the combination of speech emotion features that was optimal for speech emotion recognition. Next, a parallel deep learning model was established with the speech emotion feature combination as input to train and test for each type of emotion. Finally, decision fusion was applied to the parallel output results to obtain an overall recognition result. These analyses were conducted on two datasets, RAVDESS and EMO-DB, with the accuracy of speech emotion recognition reaching 82.3% and 88.8%, respectively. The results demonstrate that the F-Emotion algorithm can effectively analyze the correspondence between speech emotion features and emotion types. The MFCC feature best describes emotions of neutrality, happiness, fear, and surprise, and Mel best describes emotions of anger and sadness. The parallel deep learning model mechanism can improve the accuracy of speech emotion recognition.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords