Jisuanji kexue yu tansuo (Mar 2025)

Review on Key Techniques of Video Multimodal Sentiment Analysis

  • DUAN Zongtao, HUANG Junchen, ZHU Xiaole

DOI
https://doi.org/10.3778/j.issn.1673-9418.2404072
Journal volume & issue
Vol. 19, no. 3
pp. 539 – 558

Abstract

Read online

Sentiment analysis is the process of automatically determining an opinion holder􀆳s attitude or emotional tendency. It is widely used in business, social media analysis, and public opinion monitoring. In unimodal sentiment analysis, most researchers use text, facial expressions, and audio information. With the development of deep learning technology, sentiment analysis has expanded from a unimodal to a multimodal field. Combining multiple modalities can address the limitations of a unimodal and understand the emotions expressed by people more accurately and comprehensively. This paper summarizes the critical techniques of multimodal sentiment analysis based on three kinds of unimodal sentiment analysis. Firstly, the multimodal sentiment analysis background and its research status are briefly introduced. Secondly, the relevant datasets that are commonly used are summarized. Then, this paper describes the unimodal sentiment analysis based on text, facial expression, and audio information. In addition, this paper analyzes the critical techniques of video multimodal sentiment analysis, including multimodal fusion, alignment and modal noise processing, and provides a detailed analysis of these techniques’ relationships and their applications. Next, the performance metrics of different models on three commonly used datasets are analyzed, further validating the effectiveness of these key techniques. Finally, the existing challenges in multimodal sentiment analysis and future development trends are discussed.

Keywords