Investigating The Efficiency Of Machine Learning Methods In Authorship Detection For Low-Resourced Languages: The Case Of Kurdish Authors

Saia Hasan; Hossein Hassani

doi:10.23918/eajse.v9i2p14

Eurasian Journal of Science and Engineering (Jun 2023)

Investigating The Efficiency Of Machine Learning Methods In Authorship Detection For Low-Resourced Languages: The Case Of Kurdish Authors

Saia Hasan,
Hossein Hassani

Affiliations

Saia Hasan: Computer Science and Engineering, University of Kurdistan Hewlêr, Kurdistan Region, Iraq
Hossein Hassani: Computer Science and Engineering, University of Kurdistan Hewlêr, Kurdistan Region, Iraq

DOI: https://doi.org/10.23918/eajse.v9i2p14
Journal volume & issue: Vol. 9, no. 2
pp. 178 – 194

Abstract

Read online

Textual data continues to multiply with time, Alongside the exponential growth of textual information, an increase in anonymous material has also been seen. Authorship detection has significant potential for usage in numerous applications of authorship analysis, such as history and literary science, Forensic examination, or Plagiarism detection. We manually collected 2798 documents by 150 authors for this study in order to investigate how effectively existing machine learning algorithms can differentiate Kurdish authors from unidentified writings. The approach that has been developed uses a TF-IDF technique to calculate the weight of each token and extracts the token frequency of each token, ranging from 1 to 5 grams, as a feature to find a pattern in each author's text. We train SVM, CNB, MNB, and K-NN classifiers with a collection of available documents because an unknown document's essential tokens are similar to a known document's crucial tokens. Then we give it a mysterious document so it may assess how closely it resembles the known document. We achieved an accuracy of 80% by SVM with both O-V-O and O-V-R approaches for the token 1-gram, also a promising results in precision, recall, and F1-score measures. Furthermore, to our knowledge, this is the first study to investigate authorship detection for the Kurdish language.

Published in Eurasian Journal of Science and Engineering

ISSN: 2414-5629 (Print); 2414-5602 (Online)
Publisher: Tishk International University
Country of publisher: Iraq
LCC subjects: Science
Website: https://eajse.tiu.edu.iq/

About the journal

Abstract

Keywords