Data in Brief (Dec 2024)
A Kurdish Sorani Twitter dataset for language modelling
Abstract
Sentiment analysis is an essential task that involves the extraction, identification, characterization, and classification of textual data to understand and categorize the attitudes and opinions expressed by individuals. While other languages have extensive datasets in this field, the number of sentiment analysis datasets in the Kurdish language is extremely limited, highlighting the necessity to build datasets for the language to advance its development. This paper presents a Twitter dataset comprising 24,668 tweets from the initial sample of 30,009 texts. Human annotators labelled the tweets based on subjectivity, sentiment, offensiveness, and target. After the initial annotation, an independent reviewer examined all labelled data to ensure the construction of a robust dataset. The cleaned dataset includes 8772 subjective tweets and 15,896 non-subjective tweets. Regarding sentiment, 12,938 were classified as negative, 3189 as neutral, and 8541 as positive. Moreover, 22,436 were non-offensive tweets, while 2232 were offensive. Additionally, the dataset distinguishes between targeted and non-targeted tweets, with 22,436 tweets not aimed at specific individuals or entities, and 2232 tweets directed towards particular targets. This dataset serves as an essential resource for scholars in the field to build state-of-the-art models for the Kurdish language.