A Hybrid Latent Space Data Fusion Method for Multimodal Emotion Recognition

Shahla Nemati; Reza Rohani; Mohammad Ehsan Basiri; Moloud Abdar; Neil Y. Yen; Vladimir Makarenkov

doi:10.1109/ACCESS.2019.2955637

IEEE Access (Jan 2019)

A Hybrid Latent Space Data Fusion Method for Multimodal Emotion Recognition

Shahla Nemati,
Reza Rohani,
Mohammad Ehsan Basiri,
Moloud Abdar,
Neil Y. Yen,
Vladimir Makarenkov

Affiliations

Shahla Nemati: ORCiD; Department of Computer Engineering, Shahrekord University, Shahrekord, Iran
Reza Rohani: ORCiD; Department of Computer Engineering, Shahrekord University, Shahrekord, Iran
Mohammad Ehsan Basiri: ORCiD; Department of Computer Engineering, Shahrekord University, Shahrekord, Iran
Moloud Abdar: ORCiD; Department of Computer Science, Université du Québec à Montréal, Montréal, QC, Canada
Neil Y. Yen: ORCiD; School of Computer Science and Engineering, University of Aizu, Aizu, Japan
Vladimir Makarenkov: ORCiD; Department of Computer Science, Université du Québec à Montréal, Montréal, QC, Canada

DOI: https://doi.org/10.1109/ACCESS.2019.2955637
Journal volume & issue: Vol. 7
pp. 172948 – 172964

Abstract

Read online

Multimodal emotion recognition is an emerging interdisciplinary field of research in the area of affective computing and sentiment analysis. It aims at exploiting the information carried by signals of different nature to make emotion recognition systems more accurate. This is achieved by employing a powerful multimodal fusion method. In this study, a hybrid multimodal data fusion method is proposed in which the audio and visual modalities are fused using a latent space linear map and then, their projected features into the cross-modal space are fused with the textual modality using a Dempster-Shafer (DS) theory-based evidential fusion method. The evaluation of the proposed method on the videos of the DEAP dataset shows its superiority over both decision-level and non-latent space fusion methods. Furthermore, the results reveal that employing Marginal Fisher Analysis (MFA) for feature-level audio-visual fusion results in higher improvement in comparison to cross-modal factor analysis (CFA) and canonical correlation analysis (CCA). Also, the implementation results show that exploiting textual users' comments with the audiovisual content of movies improves the performance of the system.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords