A dataset of Roman Urdu text with spelling variations for sentence level sentiment analysisMendeley Data

Mudasar Ahmed Soomro; Rafia Naz Memon; Asghar Ali Chandio; Mehwish Leghari; Muhammad Hanif Soomro

Data in Brief (Dec 2024)

A dataset of Roman Urdu text with spelling variations for sentence level sentiment analysisMendeley Data

Mudasar Ahmed Soomro,
Rafia Naz Memon,
Asghar Ali Chandio,
Mehwish Leghari,
Muhammad Hanif Soomro

Affiliations

Mudasar Ahmed Soomro: Department of Information Technology, Quaid-e-Awam University of Engineering, Science & Technology, Nawabshah, Pakistan; Corresponding author.
Rafia Naz Memon: Department of Software Engineering, Quaid-e-Awam University of Engineering, Science & Technology, Nawabshah, Pakistan
Asghar Ali Chandio: Department of Artificial Intelligence, Quaid-e-Awam University of Engineering, Science & Technology, Nawabshah, Pakistan; School of Engineering and Information Technology, The University of New South Wales, Australia
Mehwish Leghari: Department of Data Science, Quaid-e-Awam University of Engineering, Science & Technology, Nawabshah, Pakistan
Muhammad Hanif Soomro: Department of Information Technology, University of Sindh, Jamshoro, Pakistan

Journal volume & issue: Vol. 57
p. 111170

Abstract

Read online

Roman Urdu text is very widespread on many websites. People mostly prefer to give their social comments or product reviews in Roman Urdu, and Roman Urdu is counted as non-standard language. The main reason for this is that there is no rule for word spellings within Roman Urdu words, so people create and post their own word spellings, like “2mro” is a nonstandard spelling for tomorrow. This paper aims to collect two Roman Urdu datasets: one is roman Urdu words with various spelling variations. This dataset contains 5244 Roman Urdu words, within which we have included variations in word spellings ranging from (one) to (five) different spellings for each word. The second dataset consists of Roman Urdu reviews, which were collected from (seven) different internet-based sources. This dataset contains multiclass reviews, namely “very positive,” “positive,” “very negative,” “negative,” and “neutral”, respectively. We gathered a total of 28,090 reviews. The sentiments of the reviews were made by the domain experts who were familiar with the Urdu language.

Published in Data in Brief

ISSN: 2352-3409 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Science (General)
Website: http://www.journals.elsevier.com/data-in-brief/

About the journal

Abstract

Keywords