A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition

Nikolaos Vryzas; Lazaros Vrysis; Rigas Kotsakis; Charalampos Dimoulas

Machine Learning with Applications (Dec 2021)

A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition

Nikolaos Vryzas,
Lazaros Vrysis,
Rigas Kotsakis,
Charalampos Dimoulas

Affiliations

Nikolaos Vryzas: Multidisciplinary Media and Mediated Communication (M3C) Research Group, Aristotle University of Thessaloniki, Pavillion 1, TIF-HELEXPO, 546 36, Greece; Corresponding author.
Lazaros Vrysis: Multidisciplinary Media and Mediated Communication (M3C) Research Group, Aristotle University of Thessaloniki, Pavillion 1, TIF-HELEXPO, 546 36, Greece
Rigas Kotsakis: Multidisciplinary Media and Mediated Communication (M3C) Research Group, International Hellenic University, Greece
Charalampos Dimoulas: Multidisciplinary Media and Mediated Communication (M3C) Research Group, Aristotle University of Thessaloniki, Pavillion 1, TIF-HELEXPO, 546 36, Greece

Journal volume & issue: Vol. 6
p. 100132

Abstract

Read online

Speech Emotion Recognition (SER) is an important part of Affective Computing and emotionally aware Human–Computer Interaction. Emotional expression may vary depending on the language, culture, and the speaker’s personality and vocal attributes. Speaker-adaptive systems can address this issue. In real-world applications, it is not feasible to obtain big datasets for deep learning model training from a specific speaker. This paper proposes a transfer learning approach for personalized SER based on convolutional neural networks. A CNN is trained in a multi-user dataset for generalization and then is fine-tuned for a small speaker-specific dataset. A VGGish model, pre-trained a large-scale dataset for audio event recognition is also evaluated for the task. This comparison highlights the significance of network capacity, dataset length, and domain-relativity for transfer learning. To enhance the applicability of this approach in real-world conditions, a web crowdsourcing application is implemented. An online platform is provided where contributors can follow a standard procedure to record and submit annotated utterances of emotional speech. The recordings are validated and added to the publicly available AESDD dataset of emotional speech. The platform can be used for the creation of personalized emotional speech datasets for speaker-adaptive SER, following the transfer learning strategies that have been evaluated.

Published in Machine Learning with Applications

ISSN: 2666-8270 (Online)
Publisher: Elsevier
Country of publisher: United Kingdom
LCC subjects: Science: Science (General): Cybernetics; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.journals.elsevier.com/machine-learning-with-applications

About the journal

Abstract

Keywords