Intelligent Systems with Applications (Sep 2024)

AdaFN-AG: Enhancing multimodal interaction with Adaptive Feature Normalization for multimodal sentiment analysis

  • Weilong Liu,
  • Hua Xu,
  • Yu Hua,
  • Yunxian Chi,
  • Kai Gao

Journal volume & issue
Vol. 23
p. 200410

Abstract

Read online

In multimodal sentiment analysis, achieving effective fusion among text, acoustic, and visual modalities for enhanced sentiment prediction is a crucial research topic. Recent studies typically employ tensor-based or attention-based mechanisms for multimodal fusion. However, the former fails to achieve satisfactory prediction performance, and the latter complicates the computation of fusion between non-textual modalities. Therefore, this paper proposes the multimodal sentiment analysis model based on Adaptive Feature Normalization and Attention Gating mechanism (AdaFN-AG). Firstly, facing highly synchronized non-textual modalities, we design the Adaptive Feature Normalization (AdaFN) method, which focuses more on sentiment features interaction rather than timing features association. In AdaFN, acoustic and visual modality features achieve cross-modal interaction through normalization, inverse normalization, and mix-up operations, with weights utilized for adaptive strength regulation of the cross-modal interaction. Meanwhile, we design the Attention Gating mechanism that facilitates cross-modal interactions between textual and non-textual modalities through cross-attention and captures timing associations, while the gating module concurrently regulates the intensity of these interactions. Additionally, we employ self-attention to capture the intrinsic correlations within single-modal features. Subsequently, we conduct experiments on three benchmark datasets for multimodal sentiment analysis, with the results indicating that AdaFN-AG outperforms the baselines across the majority of evaluation metrics. Through research and experiments, we validate that AdaFN-AG not only enhances performance by adopting appropriate methods for different types of cross-modal interactions while conserving computational resources but also verifies the generalization capability of the AdaFN method.

Keywords