IEEE Access (Jan 2019)
Learning Hierarchical Emotion Context for Continuous Dimensional Emotion Recognition From Video Sequences
Abstract
Dimensional emotion recognition is currently one of the most challenging tasks in the field of affective computing. In this paper, a novel three-stage method is proposed to learn hierarchical emotion context information (feature- and label-level contexts) for predicting affective dimension values from video sequences. In the first stage, a feed-forward neural network is used to generate a high-level representation of the raw input features. Then, in the second stage, the bidirectional long short-term memory (BLSTM) layers learn the context information of the feature sequences from the high-level representation and get the initial recognition results of the input. Finally, in the third stage, a BLSTM neural network is used to learn the context information from emotion label sequences by an unsupervised way, which is used to correct the initial recognition results and get the final results. We also explore the influence of different sequence lengths by sampling from the original sequences. The experiment performed on the video data of AVEC 2015 demonstrates the effectiveness of the proposed method. Our framework highlights that incorporating both feature/label level dependencies and context information is a promising research direction for predicting the continuous dimensional emotion.
Keywords