Multi-Label Multimodal Emotion Recognition With Transformer-Based Fusion and Emotion-Level Representation Learning

Hoai-Duy Le; Guee-Sang Lee; Soo-Hyung Kim; Seungwon Kim; Hyung-Jeong Yang

doi:10.1109/ACCESS.2023.3244390

IEEE Access (Jan 2023)

Multi-Label Multimodal Emotion Recognition With Transformer-Based Fusion and Emotion-Level Representation Learning

Hoai-Duy Le,
Guee-Sang Lee,
Soo-Hyung Kim,
Seungwon Kim,
Hyung-Jeong Yang

Affiliations

Hoai-Duy Le: Department of Artificial Intelligence Convergence, Chonnam National University, Gwangju, South Korea
Guee-Sang Lee: ORCiD; Department of Artificial Intelligence Convergence, Chonnam National University, Gwangju, South Korea
Soo-Hyung Kim: ORCiD; Department of Artificial Intelligence Convergence, Chonnam National University, Gwangju, South Korea
Seungwon Kim: Department of Artificial Intelligence Convergence, Chonnam National University, Gwangju, South Korea
Hyung-Jeong Yang: ORCiD; Department of Artificial Intelligence Convergence, Chonnam National University, Gwangju, South Korea

DOI: https://doi.org/10.1109/ACCESS.2023.3244390
Journal volume & issue: Vol. 11
pp. 14742 – 14751

Abstract

Read online

Emotion recognition has been an active research area for a long time. Recently, multimodal emotion recognition from video data has grown in importance with the explosion of video content due to the emergence of short video social media platforms. Effectively incorporating information from multiple modalities in video data to learn robust multimodal representation for improving recognition model performance is still the primary challenge for researchers. In this context, transformer architectures have been widely used and have significantly improved multimodal deep learning and representation learning. Inspired by this, we propose a transformer-based fusion and representation learning method to fuse and enrich multimodal features from raw videos for the task of multi-label video emotion recognition. Specifically, our method takes raw video frames, audio signals, and text subtitles as inputs and passes information from these multiple modalities through a unified transformer architecture for learning a joint multimodal representation. Moreover, we use the label-level representation approach to deal with the multi-label classification task and enhance the model performance. We conduct experiments on two benchmark datasets: Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) to evaluate our proposed method. The experimental results demonstrate that the proposed method outperforms other strong baselines and existing approaches for multi-label video emotion recognition.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords