Jisuanji kexue yu tansuo (Sep 2024)

Multimodal Sentiment Analysis Based on Cross-Modal Semantic Information Enhancement

  • LI Mengyun, ZHANG Jing, ZHANG Huanxiang, ZHANG Xiaolin, LIU Luyao

DOI
https://doi.org/10.3778/j.issn.1673-9418.2307045
Journal volume & issue
Vol. 18, no. 9
pp. 2476 – 2486

Abstract

Read online

With the development of social networks, humans express their emotions in different ways, including text, vision and speech, i.e., multimodal. In response to the failure of previous multimodal sentiment analysis methods to effectively obtain multimodal sentiment feature representations and the failure to fully consider the impact of redundant information on experiments during multimodal feature fusion, a multimodal sentiment analysis model based on cross-modal semantic information enhancement is proposed. Firstly, the model adopts BiLSTM network to mine the contextual information within each unimodal mode. Secondly, the information interaction between multiple modalities is modeled through the cross-modal information interaction mechanism to obtain six kinds of information interaction features, namely, text-to-speech and vision, speech-to-text and vision, and vision-to-text and speech, and then the same information interaction features of the target modalities are spliced together to obtain the information-enhanced unimodal feature vectors, which can efficiently obtain the shared and complementary in-depth semantic features between modalities. In addition, the semantic correlations between the original unimodal feature vectors and the information-enhanced unimodal feature vectors are computed separately using the multi-head self-attention mechanism, which improves the ability of identifying the key sentiment features and reduces the negative interference of the redundant information on the sentiment analysis. Experimental results on the public datasets CMU-MOSI (CMU multimodal opinion level sentiment intensity) and CMU-MOSEI (CMU multimodal opinion sentiment and emotion intensity) show that the proposed model can both enhance sentiment feature representation and effectively reduce the interference of redundant information, and it outperforms related works in terms of multimodal sentiment classification accuracy and generalization ability.

Keywords