Speech Emotion Recognition Using Transfer Learning: Integration of Advanced Speaker Embeddings and Image Recognition Models

Maros Jakubec; Eva Lieskovska; Roman Jarina; Michal Spisiak; Peter Kasak

doi:10.3390/app14219981

Applied Sciences (Oct 2024)

Speech Emotion Recognition Using Transfer Learning: Integration of Advanced Speaker Embeddings and Image Recognition Models

Maros Jakubec,
Eva Lieskovska,
Roman Jarina,
Michal Spisiak,
Peter Kasak

Affiliations

Maros Jakubec: University Science Park UNIZA, University of Žilina, Univerzitna 8215/1, 010 26 Žilina, Slovakia
Eva Lieskovska: University Science Park UNIZA, University of Žilina, Univerzitna 8215/1, 010 26 Žilina, Slovakia
Roman Jarina: Faculty of Electrical Engineering and Information Technology, University of Žilina, Univerzitna 8215/1, 010 26 Žilina, Slovakia
Michal Spisiak: Faculty of Electrical Engineering and Information Technology, University of Žilina, Univerzitna 8215/1, 010 26 Žilina, Slovakia
Peter Kasak: Faculty of Electrical Engineering and Information Technology, University of Žilina, Univerzitna 8215/1, 010 26 Žilina, Slovakia

DOI: https://doi.org/10.3390/app14219981
Journal volume & issue: Vol. 14, no. 21
p. 9981

Abstract

Read online

Automatic Speech Emotion Recognition (SER) plays a vital role in making human–computer interactions more natural and effective. A significant challenge in SER development is the limited availability of diverse emotional speech datasets, which hinders the application of advanced deep learning models. Transfer learning is a machine learning technique that helps address this issue by utilizing knowledge from pre-trained models to improve performance on a new task in a target domain, even with limited data. This study investigates the use of transfer learning from various pre-trained networks, including speaker embedding models such as d-vector, x-vector, and r-vector, and image classification models like AlexNet, GoogLeNet, SqueezeNet, ResNet-18, and ResNet-50. We also propose enhanced versions of the x-vector and r-vector models incorporating Multi-Head Attention Pooling and Angular Margin Softmax, alongside other architectural improvements. Additionally, reverberation from the Room Impulse Response datasets was added to the speech utterances to diversify and augment the available data. Notably, the enhanced r-vector model achieved classification accuracies of 74.05% Unweighted Accuracy (UA) and 73.68% Weighted Accuracy (WA) on the IEMOCAP dataset, and 80.25% UA and 79.81% WA on the CREMA-D dataset, outperforming the existing state-of-the-art methods. This study shows that using cross-domain transfer learning is beneficial for low-resource emotion recognition. The enhanced models developed in other domains (for non-emotional tasks) can further improve the accuracy of SER.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords