IEEE Access (Jan 2024)
Facial Expression Recognition Using Visible, IR, and MSX Images by Early and Late Fusion of Deep Learning Models
Abstract
Facial expression recognition (FER) is one of the best non-intrusive methods for understanding and tracking mood and mental states. In this study, we propose early and late fusion methods to recognize five facial expressions (angry, happy, neutral, sad, and surprised) using different combinations from a publicly available database (VIRI) with visible, infrared, and multispectral dynamic imaging (MSX) images and the (NVIE) database. A distinctive feature is the use of concatenation and combining techniques to combine ResNet-18 with transfer learning (TL) to create a model that is significantly more accurate than individual models. In the early fusion, we concatenated features from the modalities and classified facial expressions (FEs). In the late fusion, we combined the outputs of the modalities using weighted sums. For this purpose, we used different weighting factors depending on the accuracy of the individual models. The experimental results demonstrated that the proposed model outperformed the previous works by providing an accuracy of 83.33% when we trained the model (1-step training). Through further fine-tuning (3-step training), we obtained an improved performance of 84.44%. We conducted additional experiments by combining them with another modality (MSX) available in the database. By performing experiments with an additional modality (MSX), we obtained improved performance, which confirms that the additional modality combined with existing modalities can help improve the performance of fusion models for facial expression recognition. We also experimented by changing the backbones (Vgg-16, ShuffleNetv2, MobileNetv2, and GhostNet) in addition to ResNet-18 for visible and MSX data. ResNet-18 outperformed the other backbones in facial expression recognition for visible and MSX data.
Keywords