M2ER: Multimodal Emotion Recognition Based on Multi-Party Dialogue Scenarios

Bo Zhang; Xiya Yang; Ge Wang; Ying Wang; Rui Sun

doi:10.3390/app132011340

Applied Sciences (Oct 2023)

M2ER: Multimodal Emotion Recognition Based on Multi-Party Dialogue Scenarios

Bo Zhang,
Xiya Yang,
Ge Wang,
Ying Wang,
Rui Sun

Affiliations

Bo Zhang: School of Information and Communication Engineering, Communication University of China, Dingfuzhuang, Chaoyang District, Beijing 10024, China
Xiya Yang: School of Information and Communication Engineering, Communication University of China, Dingfuzhuang, Chaoyang District, Beijing 10024, China
Ge Wang: School of Information and Communication Engineering, Communication University of China, Dingfuzhuang, Chaoyang District, Beijing 10024, China
Ying Wang: School of Information and Communication Engineering, Communication University of China, Dingfuzhuang, Chaoyang District, Beijing 10024, China
Rui Sun: School of Computing, Newcastle University, Newcastle upon Tyne NE1 7RU, UK

DOI: https://doi.org/10.3390/app132011340
Journal volume & issue: Vol. 13, no. 20
p. 11340

Abstract

Read online

Researchers have recently focused on multimodal emotion recognition, but issues persist in recognizing emotions in multi-party dialogue scenarios. Most studies have only used text and audio modality, ignoring the video modality. To address this, we propose M2ER, a multimodal emotion recognition scheme based on multi-party dialogue scenarios. Addressing the issue of multiple faces appearing in the same frame of the video modality, M2ER introduces a method using multi-face localization for speaker recognition to eliminate the interference of non-speakers. The attention mechanism is used to fuse and classify different modalities. We conducted extensive experiments in unimodal and multimodal fusion using the multi-party dialogue dataset MELD. The results show that M2ER achieves superior emotion recognition in both text and audio modalities compared to the baseline model. The proposed method using speaker recognition in the video modality improves emotion recognition performance by 6.58% compared to the method without speaker recognition. In addition, the multimodal fusion based on the attention mechanism also outperforms the baseline fusion model.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords