IEEE Access (Jan 2024)

Enhancing Emoji-Based Sentiment Classification in Urdu Tweets: Fusion Strategies With Multilingual BERT and Emoji Embeddings

  • Komal Rani Narejo,
  • Hongying Zan,
  • Dina Oralbekova,
  • Kheem Parkash Dharmani,
  • Mamyrbayev Orken,
  • Kuralai Mukhsina

DOI
https://doi.org/10.1109/ACCESS.2024.3446897
Journal volume & issue
Vol. 12
pp. 126587 – 126600

Abstract

Read online

X (formerly known as Twitter) is a popular social network with hundreds of millions of users. We emphasize the benefits of using emojis to enhance the comprehension of user sentiment. Our objective was to analyze the sentiments expressed in Urdu language tweets, a task that can be demanding due to the language’s intricate structure and diverse dialects. Our research revolves around combining emoji embeddings with the SentiUrdu-1M dataset, consisting of 1.14 million Urdu tweets and 1,194 emojis, using multilingual BERT (mBERT). The major motive of our study is twofold: 1) to evaluate the performance of pre-trained emoji2vec and our proposed method of Urdu-Specific FastText emoji embeddings in terms of their ability to distinguish emojis based on their expressions; and 2) to explore techniques for integrating Urdu tweets and emoji embeddings, including concatenation, neural network fusion, and attention mechanism fusion. Moreover, we fine-tuned the baseline models on only-text Urdu tweets using multilingual BERT and XLM-RoBERTa, achieving accuracies of 64% and 65%, respectively. Therefore, our study fills a gap in the literature by investigating the possibility of enhancing sentiment analysis in Urdu language tweets through emojis, a field that has received limited attention. The Urdu-Specific FastText emoji embeddings proposed in this paper yield better results than the pre-trained emojis from emoji2vec and improve sentiment analysis accuracy up to 95% for the neural network fusion approach.

Keywords