Emotion recognition in user‐generated videos with long‐range correlation‐aware network

Yun Yi; Jin Zhou; Hanli Wang; Pengjie Tang; Min Wang

doi:10.1049/ipr2.13174

IET Image Processing (Oct 2024)

Emotion recognition in user‐generated videos with long‐range correlation‐aware network

Yun Yi,
Jin Zhou,
Hanli Wang,
Pengjie Tang,
Min Wang

Affiliations

Yun Yi: School of Mathematics and Computer Science Gannan Normal University Ganzhou P. R. China
Jin Zhou: School of Mathematics and Computer Science Gannan Normal University Ganzhou P. R. China
Hanli Wang: Department of Computer Science and Technology Tongji University Shanghai P. R. China
Pengjie Tang: College of Electronics and Information Engineering Jinggangshan University Ji'an P. R. China
Min Wang: School of Mathematics and Computer Science Gannan Normal University Ganzhou P. R. China

DOI: https://doi.org/10.1049/ipr2.13174
Journal volume & issue: Vol. 18, no. 12
pp. 3288 – 3301

Abstract

Read online

Abstract Emotion recognition in user‐generated videos plays an essential role in affective computing. In general, visual information directly affects human emotions, so the visual modality is significant for emotion recognition. Most classic approaches mainly focus on local temporal information of videos, which potentially restricts their capacity to encode the correlation of long‐range context. To address this issue, a novel network is proposed to recognize emotions in videos. To be specific, a spatio‐temporal correlation‐aware block is designed to depict the long‐range correlations between input tokens, where the convolutional layers are used to learn the local correlations and the inter‐image cross‐attention is designed to learn the long‐range and spatio‐temporal correlations between input tokens. To generate diverse and challenging samples, a dual‐augmentation fusion layer is devised, which fuses each frame with its corresponding frame in the temporal domain. To produce rich video clips, a long‐range sampling layer is designed, which generates clips in a wide range of spatial and temporal domains. Extensive experiments are conducted on two challenging video emotion datasets, namely VideoEmotion‐8 and Ekman‐6. The experimental results demonstrate that the proposed method obtains better performance than baseline methods. Moreover, the proposed method achieves state‐of‐the‐art results on the two datasets. The source code of the proposed network is available at: https://github.com/JinChow/LRCANet.

Published in IET Image Processing

ISSN: 1751-9659 (Print); 1751-9667 (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Technology: Photography; Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://ietresearch.onlinelibrary.wiley.com/journal/17519667

About the journal

Abstract

Keywords