Robust Multi-Scenario Speech-Based Emotion Recognition System

Fangfang Zhu-Zhou; Roberto Gil-Pita; Joaquín García-Gómez; Manuel Rosa-Zurera

doi:10.3390/s22062343

Sensors (Mar 2022)

Robust Multi-Scenario Speech-Based Emotion Recognition System

Fangfang Zhu-Zhou,
Roberto Gil-Pita,
Joaquín García-Gómez,
Manuel Rosa-Zurera

Affiliations

Fangfang Zhu-Zhou: Department of Signal Theory and Communications, University of Alcalá, 28805 Alcalá de Henares, Madrid, Spain
Roberto Gil-Pita: Department of Signal Theory and Communications, University of Alcalá, 28805 Alcalá de Henares, Madrid, Spain
Joaquín García-Gómez: Department of Signal Theory and Communications, University of Alcalá, 28805 Alcalá de Henares, Madrid, Spain
Manuel Rosa-Zurera: Department of Signal Theory and Communications, University of Alcalá, 28805 Alcalá de Henares, Madrid, Spain

DOI: https://doi.org/10.3390/s22062343
Journal volume & issue: Vol. 22, no. 6
p. 2343

Abstract

Read online

Every human being experiences emotions daily, e.g., joy, sadness, fear, anger. These might be revealed through speech—words are often accompanied by our emotional states when we talk. Different acoustic emotional databases are freely available for solving the Emotional Speech Recognition (ESR) task. Unfortunately, many of them were generated under non-real-world conditions, i.e., actors played emotions, and recorded emotions were under fictitious circumstances where noise is non-existent. Another weakness in the design of emotion recognition systems is the scarcity of enough patterns in the available databases, causing generalization problems and leading to overfitting. This paper examines how different recording environmental elements impact system performance using a simple logistic regression algorithm. Specifically, we conducted experiments simulating different scenarios, using different levels of Gaussian white noise, real-world noise, and reverberation. The results from this research show a performance deterioration in all scenarios, increasing the error probability from 25.57% to 79.13% in the worst case. Additionally, a virtual enlargement method and a robust multi-scenario speech-based emotion recognition system are proposed. Our system’s average error probability of 34.57% is comparable to the best-case scenario with 31.55%. The findings support the prediction that simulated emotional speech databases do not offer sufficient closeness to real scenarios.

Published in Sensors

ISSN: 1424-8220 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology
Website: http://www.mdpi.com/journal/sensors

About the journal

Abstract

Keywords