IET Image Processing (Oct 2024)

Emotion recognition in user‐generated videos with long‐range correlation‐aware network

  • Yun Yi,
  • Jin Zhou,
  • Hanli Wang,
  • Pengjie Tang,
  • Min Wang

DOI
https://doi.org/10.1049/ipr2.13174
Journal volume & issue
Vol. 18, no. 12
pp. 3288 – 3301

Abstract

Read online

Abstract Emotion recognition in user‐generated videos plays an essential role in affective computing. In general, visual information directly affects human emotions, so the visual modality is significant for emotion recognition. Most classic approaches mainly focus on local temporal information of videos, which potentially restricts their capacity to encode the correlation of long‐range context. To address this issue, a novel network is proposed to recognize emotions in videos. To be specific, a spatio‐temporal correlation‐aware block is designed to depict the long‐range correlations between input tokens, where the convolutional layers are used to learn the local correlations and the inter‐image cross‐attention is designed to learn the long‐range and spatio‐temporal correlations between input tokens. To generate diverse and challenging samples, a dual‐augmentation fusion layer is devised, which fuses each frame with its corresponding frame in the temporal domain. To produce rich video clips, a long‐range sampling layer is designed, which generates clips in a wide range of spatial and temporal domains. Extensive experiments are conducted on two challenging video emotion datasets, namely VideoEmotion‐8 and Ekman‐6. The experimental results demonstrate that the proposed method obtains better performance than baseline methods. Moreover, the proposed method achieves state‐of‐the‐art results on the two datasets. The source code of the proposed network is available at: https://github.com/JinChow/LRCANet.

Keywords