Multimodal Emotion Recognition Using Feature Fusion: An LLM-Based Approach

Omkumar Chandraumakantham; N. Gowtham; Mohammed Zakariah; Abdulaziz Almazyad

doi:10.1109/ACCESS.2024.3425953

IEEE Access (Jan 2024)

Multimodal Emotion Recognition Using Feature Fusion: An LLM-Based Approach

Omkumar Chandraumakantham,
N. Gowtham,
Mohammed Zakariah,
Abdulaziz Almazyad

Affiliations

Omkumar Chandraumakantham: ORCiD; School of Computer Science and Engineering, Vellore Institute of Technology- Chennai Campus, Chennai, India
N. Gowtham: ORCiD; School of Computer Science and Engineering, Vellore Institute of Technology- Chennai Campus, Chennai, India
Mohammed Zakariah: ORCiD; Department of Computer Sciences and Engineering, College of Applied Science, King Saud University, Riyadh, Saudi Arabia
Abdulaziz Almazyad: ORCiD; Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia

DOI: https://doi.org/10.1109/ACCESS.2024.3425953
Journal volume & issue: Vol. 12
pp. 108052 – 108071

Abstract

Read online

Multimodal emotion recognition is a developing field that analyzes emotions through various channels, mainly audio, video, and text. However, existing state-of-the-art systems focus on two to three modalities at the most, utilize traditional techniques, fail to consider emotional interplay, lack the scope to add more modalities, and aren’t efficient in predicting emotions accurately. This research proposes a novel approach using rule-based systems to convert non-verbal cues to text, inspired by a limited prior attempt that lacked proper benchmarking. It achieves efficient multimodal emotion recognition by utilizing distilRoBERTa, a large language model fine-tuned with a combined textual representation of audio (such as loudness, spectral flux, MFCCs, pitch stability, and emphasis) and visual features (action units) extracted from videos. This approach is evaluated using the datasets RAVDESS and BAUM-1. It achieves high accuracy (93.18% in RAVDESS and 93.69% in BAUM-1) on both datasets, performing on par with the SOTA (state-of-the-art) systems, if not slightly better. Furthermore, the research highlights the potential for incorporating additional modalities by transforming them into text using rule-based systems and utilizing them to refine further pre-trained large language models, giving rise to a more comprehensive approach to emotion recognition.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords