Data in Brief (Dec 2024)

A dataset of Roman Urdu text with spelling variations for sentence level sentiment analysisMendeley Data

  • Mudasar Ahmed Soomro,
  • Rafia Naz Memon,
  • Asghar Ali Chandio,
  • Mehwish Leghari,
  • Muhammad Hanif Soomro

Journal volume & issue
Vol. 57
p. 111170

Abstract

Read online

Roman Urdu text is very widespread on many websites. People mostly prefer to give their social comments or product reviews in Roman Urdu, and Roman Urdu is counted as non-standard language. The main reason for this is that there is no rule for word spellings within Roman Urdu words, so people create and post their own word spellings, like “2mro” is a nonstandard spelling for tomorrow. This paper aims to collect two Roman Urdu datasets: one is roman Urdu words with various spelling variations. This dataset contains 5244 Roman Urdu words, within which we have included variations in word spellings ranging from (one) to (five) different spellings for each word. The second dataset consists of Roman Urdu reviews, which were collected from (seven) different internet-based sources. This dataset contains multiclass reviews, namely “very positive,” “positive,” “very negative,” “negative,” and “neutral”, respectively. We gathered a total of 28,090 reviews. The sentiments of the reviews were made by the domain experts who were familiar with the Urdu language.

Keywords