CAAI Transactions on Intelligence Technology (Aug 2024)

Conditional selection with CNN augmented transformer for multimodal affective analysis

  • Jianwen Wang,
  • Shiping Wang,
  • Shunxin Xiao,
  • Renjie Lin,
  • Mianxiong Dong,
  • Wenzhong Guo

DOI
https://doi.org/10.1049/cit2.12320
Journal volume & issue
Vol. 9, no. 4
pp. 917 – 931

Abstract

Read online

Abstract Attention mechanism has been a successful method for multimodal affective analysis in recent years. Despite the advances, several significant challenges remain in fusing language and its nonverbal context information. One is to generate sparse attention coefficients associated with acoustic and visual modalities, which helps locate critical emotional semantics. The other is fusing complementary cross‐modal representation to construct optimal salient feature combinations of multiple modalities. A Conditional Transformer Fusion Network is proposed to handle these problems. Firstly, the authors equip the transformer module with CNN layers to enhance the detection of subtle signal patterns in nonverbal sequences. Secondly, sentiment words are utilised as context conditions to guide the computation of cross‐modal attention. As a result, the located nonverbal features are not only salient but also complementary to sentiment words directly. Experimental results show that the authors’ method achieves state‐of‐the‐art performance on several multimodal affective analysis datasets.

Keywords