IEEE Access (Jan 2019)

An Empirical Study on Forensic Analysis of Urdu Text Using LDA-Based Authorship Attribution

  • Waheed Anwar,
  • Imran Sarwar Bajwa,
  • M. Abbas Choudhary,
  • Shabana Ramzan

DOI
https://doi.org/10.1109/ACCESS.2018.2885011
Journal volume & issue
Vol. 7
pp. 3224 – 3234

Abstract

Read online

In the recent years, text-based digital forensic has evolved into a major research domain that supports digital investigation. A piece of text can be a critical source of information that is written by somebody with respect to writing style, usage of typical vocabulary, and so on. In this paper, we present a unified approach for intelligent association analysis of text of how much a piece of text is related to a person with respect to his stylometric writing features. The latent Dirichlet allocation (LDA)-based approach emphasizes on instance-based and profile-based classification of an author’s text. Here, LDA suitably handles the high dimensional and sparse data by allowing more expressive representation of text. The presented approach is an unsupervised computational methodology that can handle the heterogeneity of the dataset, diversity in writing styles of authors, and the inherent ambiguity of Urdu language text. A large corpus was collected for performance testing of the presented approach. The results of the experiments show the superiority of the proposed approach over the state-of-the-art representations and other algorithms used for authorship attribution. Manifold contributions of the presented paper are use of improved sqrt-cosine similarity with LDA topics to measure similarity in vectors of text documents for the forensic analysis purpose, construction of a large data set of 6000 documents of articles, and achievement of (92% f1-measure) results on articles without using any labels for authorship attribution task.

Keywords