IEEE Access (Jan 2020)

Enhanced Video Analytics for Sentiment Analysis Based on Fusing Textual, Auditory and Visual Information

  • Sadam Al-Azani,
  • El-Sayed M. El-Alfy

DOI
https://doi.org/10.1109/ACCESS.2020.3011977
Journal volume & issue
Vol. 8
pp. 136843 – 136857

Abstract

Read online

With the widespread of online videos and digital transformation, video informatics and analytics have recently gained substantially increasing importance with an impressive success in a variety of tasks such as digital marketing, video surveillance and security systems, healthcare systems, talk show analysis, analysis of influencing groups in social media, and target tracking. This paper evaluates the potential contribution of various video modalities and how they are correlated to video analytics for sentiment analysis in the morphologically-rich Arabic language. Moreover, an enhanced approach is presented for video analytics to predict the speaker's sentiment of multi-dialect Arabic through the integration of textual, auditory and visual modalities. Different features are extracted to represent each modality including prosodic and spectral acoustic features to represent audio, neural word embedding to represent audio text transcript, and dense optical-flow descriptors to represent visual modality. The extracted features are used individually to train two machine learning classifiers to provide a baseline. Then, the effectiveness of various combinations of modalities is verified using multi-level fusion (feature, score and decision). The experimental results demonstrate that the proposed approach of combining different modalities can lead to more accurate prediction of speaker's sentiment with above 94% accuracy.

Keywords