IEEE Access (Jan 2024)
Modality- and Subject-Aware Emotion Recognition Using Knowledge Distillation
Abstract
Multimodal emotion recognition has the potential to impact various fields, including human-computer interaction, virtual reality, and emotional intelligence systems. This study introduces a comprehensive framework that enhances the accuracy and computational efficiency of emotion recognition by leveraging knowledge distillation and transfer learning, incorporating both unimodal and multimodal models. The framework also combines subject-specific and subject-independent models, achieving a balance between localization and generalization. Subject-independent models include EEG-based, non-EEG-based (i.e., electromyography, electrooculography, electrodermal activity, galvanic skin response, skin temperature, respiration, blood volume pulse, heart rate, and eye movements), and multimodal models trained on all training subjects, capturing a broader context. Subject-specific models, including EEG-based, non-EEG-based, and multimodal models, are trained on individual subjects to provide localized knowledge. The proposed framework then distills knowledge from these teacher models into a student model, utilizing six different distillation losses to combine both subject-independent and subject-specific insights. This approach makes the model subject-aware by using local patterns and modality-aware by incorporating unimodal data, enhancing the robustness and generalizability of emotion recognition systems to varied real-world scenarios. The framework was tested on two well-known datasets, SEED-V and DEAP, as well as an immersive three-Dimensional (3D) Virtual Reality (VR) dataset, GraffitiVR, which captures emotional and behavioral responses from individuals experiencing urban graffiti in a VR environment. This broader application provides insights into the effectiveness of emotion recognition models in both 2D and 3D settings, facilitating a wider range of assessment. Empirical results demonstrate that the proposed knowledge distillation-based model significantly elevates performance across all datasets when compared to traditional models. Specifically, the model demonstrated improvements ranging from 6.56% to 24.59% over unimodal models and from 1.56% to 4.11% over multimodal approaches across the SEED-V, DEAP, and GraffitiVR datasets. These results underscore the robustness and effectiveness of the proposed approach, suggesting that it significantly enhances emotion recognition processes across various environmental settings.
Keywords