Jisuanji kexue yu tansuo (May 2024)

Temporal Multimodal Sentiment Analysis with Composite Cross Modal Interaction Network

  • YANG Li, ZHONG Junhong, ZHANG Yun, SONG Xinyu

DOI
https://doi.org/10.3778/j.issn.1673-9418.2311004
Journal volume & issue
Vol. 18, no. 5
pp. 1318 – 1327

Abstract

Read online

To address the issues of insufficient modal fusion and weak interactivity caused by semantic feature differences between different modalities in multimodal emotion analysis, a temporal multimodal sentiment analysis model for composite cross modal interaction network (CCIN-SA) is constructed by studying and analyzing the potential correlations between different modalities. The model first uses a bidirectional gated loop unit and a multi-head attention mechanism to extract temporal features of text, visual, and speech modalities with contextual semantic information. Then, a cross modal attention interaction layer is designed to continuously strengthen the target mode using low order signals from auxiliary modes, enabling the target mode to learn information from auxiliary modes and capture potential adaptability between modes. Then it inputs the enhanced features into the composite feature fusion layer, further captures the similarity between different modalities through condition vectors, enhances the correlation degree of important features, and mines deeper level interactivity between modalities. Finally, using a multi-head attention mechanism, the composite cross modal enhanced features are concatenated and fused with low order signals to increase the weight of important features within the modality, preserve the unique feature information of the initial modality, and perform the final emotion classification task on the obtained multimodal fused features. The model evaluation is conducted on the CMU-MOSI and CMU-MOSEI datasets, and the results show that the model is improved in accuracy and F1 metrics compared with other existing models. It can be seen that the CCIN-SA model can effectively explore the correlation between different modalities and make more accurate emotional judgments.

Keywords