Kurdish News Dataset Headlines (KNDH) through multiclass classification

Soran Badawi; Ari M. Saeed; Sara A. Ahmed; Peshraw Ahmed Abdalla; Diyari A. Hassan

Data in Brief (Jun 2023)

Kurdish News Dataset Headlines (KNDH) through multiclass classification

Soran Badawi,
Ari M. Saeed,
Sara A. Ahmed,
Peshraw Ahmed Abdalla,
Diyari A. Hassan

Affiliations

Soran Badawi: Language Center, Charmo University, KRG, Chamchamal, Kurdistan, Iraq
Ari M. Saeed: Computer Science Department, University of Halabja, KRG, Halabja, Kurdistan, Iraq; Corresponding author.
Sara A. Ahmed: Department of Computer Science, Komar University of Science and Technology, Sulaymaniyah, Kurdistan Region, Iraq
Peshraw Ahmed Abdalla: Computer Science Department, University of Halabja, KRG, Halabja, Kurdistan, Iraq
Diyari A. Hassan: Faculty of Engineering & Computer Science, Qaiwan International University, Sulaymaniyah, Kurdistan Region-Iraq

Journal volume & issue: Vol. 48
p. 109120

Abstract

Read online

The rapid growth of technology has massively increased the amount of text data. The data can be mined and utilized for numerous natural language processing (NLP) tasks, particularly text classification. The core part of text classification is collecting the data for predicting a good model. This paper collects Kurdish News Dataset Headlines (KNDH) for text classification. The dataset consists of 50000 news headlines which are equally distributed among five classes, with 10000 headlines for each class (Social, Sport, Health, Economic, and Technology). The percentage ratio of getting the channels of headlines is distinct, while the numbers of samples are equal for each category. There are 34 distinct channels that are used to collect the different headlines for each class, such as 8 channels for economics, 14 channels for health, 18 channels for science, 15 channels for social, and 5 channels for sport. The dataset is preprocessed using the Kurdish Language Processing Toolkit (KLPT) for tokenizing, spell-checking, stemming, and preprocessing.

Published in Data in Brief

ISSN: 2352-3409 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Science (General)
Website: http://www.journals.elsevier.com/data-in-brief/

About the journal

Abstract

Keywords