IEEE Access (Jan 2023)

Active Learning for News Article’s Authorship Identification

  • Sidra Abbas,
  • Shtwai Alsubai,
  • Gabriel Avelino Sampedro,
  • Mideth Abisado,
  • Ahmad S. Almadhor,
  • Natalia Kryvinska,
  • Monji Mohamed Zaidi

DOI
https://doi.org/10.1109/ACCESS.2023.3310813
Journal volume & issue
Vol. 11
pp. 98415 – 98426

Abstract

Read online

Over time, the amount of textual data has increased drastically, especially due to the publication of articles. As a consequence, there has been a rise in anonymous content. Research is being conducted to determine alternative methods for identifying unknown text authors. To this end, a system has to be developed to accurately determine the author of unknown texts, given a group of writing samples. Active Learning is utilized in this study because it iteratively selects the most informative samples to include in the training set, which enables a more precise and accurate authorship identification approach with fewer examples. Makes it useful for analyzing the rising amount of anonymous content and identifying unknown text authors. This study proposes a novel approach that utilizes active learning (AL) based machine models, namely Logistic Regression (AL-LR), Random Forest (AL-RF), XGboost (AL-XGB), and Multilayer Perceptron (AL-MLP) for authorship identification. The proposed approach extracts valuable characteristics of the writer using the Term Frequency-Inverse Document Frequency (TF-IDF). This study’s selected comprehensive dataset, “All the news,” is divided into three subsets: Article 1, Article 2, and Article 3. We have restricted the dataset’s scope and selected the top 50 authors for our experimentation. The experimental outcomes reveal that the proposed AL-XGB model achieves superior performance on Article 1 of the “All the news” dataset. Further, the AL-LR model performed well on Article 2, and the AL-MLP performed well on Article 3. The results suggest using the proposed approach for authorship identification.

Keywords