On the Utilization of Emoji Encoding and Data Preprocessing with a Combined CNN-LSTM Framework for Arabic Sentiment Analysis

Hussam Alawneh; Ahmad Hasasneh; Mohammed Maree

doi:10.3390/modelling5040076

Modelling (Oct 2024)

On the Utilization of Emoji Encoding and Data Preprocessing with a Combined CNN-LSTM Framework for Arabic Sentiment Analysis

Hussam Alawneh,
Ahmad Hasasneh,
Mohammed Maree

Affiliations

Hussam Alawneh: Department of Natural, Engineering and Technology Sciences, Faculty of Graduate Studies, Arab American University, Ramallah P.O. Box 240, Palestine
Ahmad Hasasneh: Department of Natural, Engineering and Technology Sciences, Faculty of Graduate Studies, Arab American University, Ramallah P.O. Box 240, Palestine
Mohammed Maree: Department of Information Technology, Arab American University, Ramallah P.O. Box 240, Palestine

DOI: https://doi.org/10.3390/modelling5040076
Journal volume & issue: Vol. 5, no. 4
pp. 1469 – 1489

Abstract

Read online

Social media users often express their emotions through text in posts and tweets, and these can be used for sentiment analysis, identifying text as positive or negative. Sentiment analysis is critical for different fields such as politics, tourism, e-commerce, education, and health. However, sentiment analysis approaches that perform well on English text encounter challenges with Arabic text due to its morphological complexity. Effective data preprocessing and machine learning techniques are essential to overcome these challenges and provide insightful sentiment predictions for Arabic text. This paper evaluates a combined CNN-LSTM framework with emoji encoding for Arabic Sentiment Analysis, using the Arabic Sentiment Twitter Corpus (ASTC) dataset. Three experiments were conducted with eight-parameter fusion approaches to evaluate the effect of data preprocessing, namely the effect of emoji encoding on their real and emotional meaning. Emoji meanings were collected from four websites specialized in finding the meaning of emojis in social media. Furthermore, the Keras tuner optimized the CNN-LSTM parameters during the 5-fold cross-validation process. The highest accuracy rate (91.85%) was achieved by keeping non-Arabic words and removing punctuation, using the Snowball stemmer after encoding emojis into Arabic text, and applying Keras embedding. This approach is competitive with other state-of-the-art approaches, showing that emoji encoding enriches text by accurately reflecting emotions, and enabling investigation of the effect of data preprocessing, allowing the hybrid model to achieve comparable results to the study using the same ASTC dataset, thereby improving sentiment analysis accuracy.

Published in Modelling

ISSN: 2673-3951 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General): Engineering design
Website: https://www.mdpi.com/journal/modelling

About the journal

Abstract

Keywords