IEEE Access (Jan 2024)
Revolutionizing Urdu Sentiment Analysis: Harnessing the Power of XLM-R and GPT-2
Abstract
Sentiment analysis extracts valuable insights from textual sources using computation, textual or systematic analysis, and natural language processing. It identifies and measures the attitudes, beliefs, and emotional states individuals express through text data. Recent research on sentiment analysis has largely focused on the English language; therefore, low-resource languages are getting much less attention. Conducting sentiment analysis of low-resource languages is difficult because large datasets and related repositories are unavailable. This paper creates a new dataset for low-resource language (Urdu) to address this issue. The dataset, namely LUCSA-23, consists of more than 65,000 user reviews from various genres, including food, sports, showbiz, apps, and political reviews from developing countries, i.e., Pakistan. Urdu domain experts further annotate the created dataset. This paper proposes an Urdu sentiment analysis approach leveraging the transformer model, i.e., XLM-R and GPT-2. It preprocesses the Urdu text input, generates BERT embeddings, and passes them to the proposed classifier as input for sentiment classification. The proposed classifier is compared with machine/deep/embedded classifiers to evaluate its performance. The findings show that the proposed classifiers outperform existing state-of-the-art approaches with an accuracy of 95%.
Keywords