Active Learning for News Article&#x2019;s Authorship Identification

Sidra Abbas; Shtwai Alsubai; Gabriel Avelino Sampedro; Mideth Abisado; Ahmad S. Almadhor; Natalia Kryvinska; Monji Mohamed Zaidi

doi:10.1109/ACCESS.2023.3310813

IEEE Access (Jan 2023)

Active Learning for News Article’s Authorship Identification

Sidra Abbas,
Shtwai Alsubai,
Gabriel Avelino Sampedro,
Mideth Abisado,
Ahmad S. Almadhor,
Natalia Kryvinska,
Monji Mohamed Zaidi

Affiliations

Sidra Abbas: ORCiD; Department of Computer Science, COMSATS University Islamabad, Islamabad, Pakistan
Shtwai Alsubai: ORCiD; College of Computer Engineering and Sciences, Prince Sattam bin Abdulaziz University, Al-Kharj, Saudi Arabia
Gabriel Avelino Sampedro: ORCiD; Faculty of Information and Communication Studies, University of the Philippines Open University, Los Baños, Philippines
Mideth Abisado: College of Computing and Information Technologies, National University, Manila, Philippines
Ahmad S. Almadhor: ORCiD; Department of Computer Engineering and Networks, College of Computer and Information Sciences, Jouf University, Sakaka, Saudi Arabia
Natalia Kryvinska: ORCiD; Information Systems Department, Faculty of Management, Comenius University Bratislava, Bratislava, Slovakia
Monji Mohamed Zaidi: ORCiD; Department of Electrical Engineering, College of Engineering, King Khalid University, Abha, Saudi Arabia

DOI: https://doi.org/10.1109/ACCESS.2023.3310813
Journal volume & issue: Vol. 11
pp. 98415 – 98426

Abstract

Read online

Over time, the amount of textual data has increased drastically, especially due to the publication of articles. As a consequence, there has been a rise in anonymous content. Research is being conducted to determine alternative methods for identifying unknown text authors. To this end, a system has to be developed to accurately determine the author of unknown texts, given a group of writing samples. Active Learning is utilized in this study because it iteratively selects the most informative samples to include in the training set, which enables a more precise and accurate authorship identification approach with fewer examples. Makes it useful for analyzing the rising amount of anonymous content and identifying unknown text authors. This study proposes a novel approach that utilizes active learning (AL) based machine models, namely Logistic Regression (AL-LR), Random Forest (AL-RF), XGboost (AL-XGB), and Multilayer Perceptron (AL-MLP) for authorship identification. The proposed approach extracts valuable characteristics of the writer using the Term Frequency-Inverse Document Frequency (TF-IDF). This study’s selected comprehensive dataset, “All the news,” is divided into three subsets: Article 1, Article 2, and Article 3. We have restricted the dataset’s scope and selected the top 50 authors for our experimentation. The experimental outcomes reveal that the proposed AL-XGB model achieves superior performance on Article 1 of the “All the news” dataset. Further, the AL-LR model performed well on Article 2, and the AL-MLP performed well on Article 3. The results suggest using the proposed approach for authorship identification.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords