Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition

Min Kyu Lee; Dae Ha Kim; Byung Cheol Song

doi:10.3390/s20185184

Sensors (Sep 2020)

Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition

Min Kyu Lee,
Dae Ha Kim,
Byung Cheol Song

Affiliations

Min Kyu Lee: Department of Electronic Engineering, Inha University, 100 Inha-ro, Michuhol-gu, Incheon 22212, Korea
Dae Ha Kim: Department of Electronic Engineering, Inha University, 100 Inha-ro, Michuhol-gu, Incheon 22212, Korea
Byung Cheol Song: Department of Electronic Engineering, Inha University, 100 Inha-ro, Michuhol-gu, Incheon 22212, Korea

DOI: https://doi.org/10.3390/s20185184
Journal volume & issue: Vol. 20, no. 18
p. 5184

Abstract

Read online

Facial expression recognition (FER) technology has made considerable progress with the rapid development of deep learning. However, conventional FER techniques are mainly designed and trained for videos that are artificially acquired in a limited environment, so they may not operate robustly on videos acquired in a wild environment suffering from varying illuminations and head poses. In order to solve this problem and improve the ultimate performance of FER, this paper proposes a new architecture that extends a state-of-the-art FER scheme and a multi-modal neural network that can effectively fuse image and landmark information. To this end, we propose three methods. To maximize the performance of the recurrent neural network (RNN) in the previous scheme, we first propose a frame substitution module that replaces the latent features of less important frames with those of important frames based on inter-frame correlation. Second, we propose a method for extracting facial landmark features based on the correlation between frames. Third, we propose a new multi-modal fusion method that effectively fuses video and facial landmark information at the feature level. By applying attention based on the characteristics of each modality to the features of the modality, novel fusion is achieved. Experimental results show that the proposed method provides remarkable performance, with 51.4% accuracy for the wild AFEW dataset, 98.5% accuracy for the CK+ dataset and 81.9% accuracy for the MMI dataset, outperforming the state-of-the-art networks.

Published in Sensors

ISSN: 1424-8220 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology
Website: http://www.mdpi.com/journal/sensors

About the journal

Abstract

Keywords