Data in Brief (Feb 2024)

Dataset construction to detect human behavior with the help of emotions, sentiments and mood for Roman Urdu

  • Asia Samreen,
  • Syed Asif Ali

Journal volume & issue
Vol. 52
p. 109906

Abstract

Read online

Roman Urdu and English are often used together as a hybrid language for communication on social media. Because writers don't worry about spelling when utilizing the English alphabet to write Urdu during texting, it becomes challenging to interpret mixed codes for emotions. There are over 14,000 emotion lexicons in this dataset, each of which lists nine different emotions and their polarities. The NRC emotion lexicons [8] provided in Urdu have been transliterated into Roman Urdu. To verify that the provided translation is accurate, we used three online dictionaries of Urdu. A Python script that transliterates words from Urdu to Roman Urdu has been used to develop Roman Urdu transliteration. Sentiment and mood, depending on the emotion lexicon, are also provided. The textual data has been annotated using the unigram feature and distance estimation among strings and lexicons. Approximately 10,000 sentences from the baseline sample have been automatically annotated.

Keywords