Enhancing Emoji-Based Sentiment Classification in Urdu Tweets: Fusion Strategies With Multilingual BERT and Emoji Embeddings

Komal Rani Narejo; Hongying Zan; Dina Oralbekova; Kheem Parkash Dharmani; Mamyrbayev Orken; Kuralai Mukhsina

doi:10.1109/ACCESS.2024.3446897

IEEE Access (Jan 2024)

Enhancing Emoji-Based Sentiment Classification in Urdu Tweets: Fusion Strategies With Multilingual BERT and Emoji Embeddings

Komal Rani Narejo,
Hongying Zan,
Dina Oralbekova,
Kheem Parkash Dharmani,
Mamyrbayev Orken,
Kuralai Mukhsina

Affiliations

Komal Rani Narejo: ORCiD; School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, China
Hongying Zan: School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, China
Dina Oralbekova: Institute of Information and Computational Technologies, Almaty, Kazakhstan
Kheem Parkash Dharmani: ORCiD; School of Computing, National University of Computer and Emerging Sciences, Islamabad, Pakistan
Mamyrbayev Orken: ORCiD; Institute of Information and Computational Technologies, Almaty, Kazakhstan
Kuralai Mukhsina: Institute of Information and Computational Technologies, Almaty, Kazakhstan

DOI: https://doi.org/10.1109/ACCESS.2024.3446897
Journal volume & issue: Vol. 12
pp. 126587 – 126600

Abstract

Read online

X (formerly known as Twitter) is a popular social network with hundreds of millions of users. We emphasize the benefits of using emojis to enhance the comprehension of user sentiment. Our objective was to analyze the sentiments expressed in Urdu language tweets, a task that can be demanding due to the language’s intricate structure and diverse dialects. Our research revolves around combining emoji embeddings with the SentiUrdu-1M dataset, consisting of 1.14 million Urdu tweets and 1,194 emojis, using multilingual BERT (mBERT). The major motive of our study is twofold: 1) to evaluate the performance of pre-trained emoji2vec and our proposed method of Urdu-Specific FastText emoji embeddings in terms of their ability to distinguish emojis based on their expressions; and 2) to explore techniques for integrating Urdu tweets and emoji embeddings, including concatenation, neural network fusion, and attention mechanism fusion. Moreover, we fine-tuned the baseline models on only-text Urdu tweets using multilingual BERT and XLM-RoBERTa, achieving accuracies of 64% and 65%, respectively. Therefore, our study fills a gap in the literature by investigating the possibility of enhancing sentiment analysis in Urdu language tweets through emojis, a field that has received limited attention. The Urdu-Specific FastText emoji embeddings proposed in this paper yield better results than the pre-trained emojis from emoji2vec and improve sentiment analysis accuracy up to 95% for the neural network fusion approach.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords