Evaluating Self-Supervised Speech Representations for Speech Emotion Recognition

Bagus Tris Atmaja; Akira Sasou

doi:10.1109/ACCESS.2022.3225198

IEEE Access (Jan 2022)

Evaluating Self-Supervised Speech Representations for Speech Emotion Recognition

Bagus Tris Atmaja,
Akira Sasou

Affiliations

Bagus Tris Atmaja: ORCiD; National Institute of Advanced Industrial Science and Technology, Tsukuba, Japan
Akira Sasou: ORCiD; National Institute of Advanced Industrial Science and Technology, Tsukuba, Japan

DOI: https://doi.org/10.1109/ACCESS.2022.3225198
Journal volume & issue: Vol. 10
pp. 124396 – 124407

Abstract

Read online

Self-supervised learning has recently been implemented widely in speech processing areas, replacing conventional acoustic feature extraction to extract meaningful information from speech. One of the challenging applications of speech processing is to extract affective information from speech, commonly called speech emotion recognition. Until now, it is not clear the position of these speech representations compared to the classical acoustic feature. This paper evaluates nineteen self-supervised speech representations and one classical acoustic feature for five distinct speech emotion recognition datasets on the same classifier. We calculate the effect size among twenty speech representations to show the magnitude of relative differences from the top to the lowest performance. The top three are WavLM Large, UniSpeech-SAT Large, and HuBERT Large, with negligible effect sizes among them. The significance test supports the difference among self-supervised speech representations. The best prediction for each dataset is shown in the form of a confusion matrix to gain insights into the best performance of speech representations for each emotion category based on the training data from balanced vs. unbalanced datasets, English vs. Japanese corpus, and five vs. six emotion categories. Despite showing their competitiveness, this exploration of self-supervised learning for speech emotion recognition also shows their limitations on models pre-trained on small data and trained on unbalanced datasets.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords