End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild

Denis Dresvyanskiy; Elena Ryumina; Heysem Kaya; Maxim Markitantov; Alexey Karpov; Wolfgang Minker

doi:10.3390/mti6020011

Multimodal Technologies and Interaction (Jan 2022)

End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild

Denis Dresvyanskiy,
Elena Ryumina,
Heysem Kaya,
Maxim Markitantov,
Alexey Karpov,
Wolfgang Minker

Affiliations

Denis Dresvyanskiy: Dialogue Group, Institute of Communications Engineering, Ulm University, 89081 Ulm, Germany
Elena Ryumina: St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 199178 St. Petersburg, Russia
Heysem Kaya: Department of Information and Computing Sciences, Utrecht University, 3584 CC Utrecht, The Netherlands
Maxim Markitantov: St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 199178 St. Petersburg, Russia
Alexey Karpov: St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 199178 St. Petersburg, Russia
Wolfgang Minker: Dialogue Group, Institute of Communications Engineering, Ulm University, 89081 Ulm, Germany

DOI: https://doi.org/10.3390/mti6020011
Journal volume & issue: Vol. 6, no. 2
p. 11

Abstract

Read online

As emotions play a central role in human communication, automatic emotion recognition has attracted increasing attention in the last two decades. While multimodal systems enjoy high performances on lab-controlled data, they are still far from providing ecological validity on non-lab-controlled, namely “in-the-wild” data. This work investigates audiovisual deep learning approaches to emotion recognition in in-the-wild problem. Inspired by the outstanding performance of end-to-end and transfer learning techniques, we explored the effectiveness of architectures in which a modality-specific Convolutional Neural Network (CNN) is followed by a Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) using the AffWild2 dataset under the Affective Behavior Analysis in-the-Wild (ABAW) challenge protocol. We deployed unimodal end-to-end and transfer learning approaches within a multimodal fusion system, which generated final predictions using a weighted score fusion scheme. Exploiting the proposed deep-learning-based multimodal system, we reached a test set challenge performance measure of 48.1% on the ABAW 2020 Facial Expressions challenge, which advances the first-runner-up performance.

Published in Multimodal Technologies and Interaction

ISSN: 2414-4088 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology; Science
Website: http://www.mdpi.com/journal/mti

About the journal

Abstract

Keywords