Multimodal Affect Models: An Investigation of Relative Salience of Audio and Visual Cues for Emotion Prediction

Jingyao Wu; Ting Dang; Ting Dang; Vidhyasaharan Sethu; Eliathamby Ambikairajah

doi:10.3389/fcomp.2021.767767

Frontiers in Computer Science (Dec 2021)

Multimodal Affect Models: An Investigation of Relative Salience of Audio and Visual Cues for Emotion Prediction

Jingyao Wu,
Ting Dang,
Ting Dang,
Vidhyasaharan Sethu,
Eliathamby Ambikairajah

Affiliations

Jingyao Wu: School of Electrical Engineering and Telecommunications, University of New South Wales, Sydney, NSW, Australia
Ting Dang: School of Electrical Engineering and Telecommunications, University of New South Wales, Sydney, NSW, Australia
Ting Dang: Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom
Vidhyasaharan Sethu: School of Electrical Engineering and Telecommunications, University of New South Wales, Sydney, NSW, Australia
Eliathamby Ambikairajah: School of Electrical Engineering and Telecommunications, University of New South Wales, Sydney, NSW, Australia

DOI: https://doi.org/10.3389/fcomp.2021.767767
Journal volume & issue: Vol. 3

Abstract

Read online

People perceive emotions via multiple cues, predominantly speech and visual cues, and a number of emotion recognition systems utilize both audio and visual cues. Moreover, the perception of static aspects of emotion (speaker's arousal level is high/low) and the dynamic aspects of emotion (speaker is becoming more aroused) might be perceived via different expressive cues and these two aspects are integrated to provide a unified sense of emotion state. However, existing multimodal systems only focus on single aspect of emotion perception and the contributions of different modalities toward modeling static and dynamic emotion aspects are not well explored. In this paper, we investigate the relative salience of audio and video modalities to emotion state prediction and emotion change prediction using a Multimodal Markovian affect model. Experiments conducted in the RECOLA database showed that audio modality is better at modeling the emotion state of arousal and video for emotion state of valence, whereas audio shows superior advantages over video in modeling emotion changes for both arousal and valence.

Published in Frontiers in Computer Science

ISSN: 2624-9898 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.frontiersin.org/journals/computer-science#

About the journal

Abstract

Keywords