An Empirical Study on Forensic Analysis of Urdu Text Using LDA-Based Authorship Attribution

Waheed Anwar; Imran Sarwar Bajwa; M. Abbas Choudhary; Shabana Ramzan

doi:10.1109/ACCESS.2018.2885011

IEEE Access (Jan 2019)

An Empirical Study on Forensic Analysis of Urdu Text Using LDA-Based Authorship Attribution

Waheed Anwar,
Imran Sarwar Bajwa,
M. Abbas Choudhary,
Shabana Ramzan

Affiliations

Waheed Anwar: ORCiD; Department of Computer Science and Information Technology, The Islamia University of Bahawalpur, Bahawalpur, Pakistan
Imran Sarwar Bajwa: ORCiD; Department of Computer Science and Information Technology, The Islamia University of Bahawalpur, Bahawalpur, Pakistan
M. Abbas Choudhary: Dadabhoy Institute of Higher Education, Karachi, Pakistan
Shabana Ramzan: Department of Computer Science, Government Sadiq College Women University, Bahawalpur, Pakistan

DOI: https://doi.org/10.1109/ACCESS.2018.2885011
Journal volume & issue: Vol. 7
pp. 3224 – 3234

Abstract

Read online

In the recent years, text-based digital forensic has evolved into a major research domain that supports digital investigation. A piece of text can be a critical source of information that is written by somebody with respect to writing style, usage of typical vocabulary, and so on. In this paper, we present a unified approach for intelligent association analysis of text of how much a piece of text is related to a person with respect to his stylometric writing features. The latent Dirichlet allocation (LDA)-based approach emphasizes on instance-based and profile-based classification of an author’s text. Here, LDA suitably handles the high dimensional and sparse data by allowing more expressive representation of text. The presented approach is an unsupervised computational methodology that can handle the heterogeneity of the dataset, diversity in writing styles of authors, and the inherent ambiguity of Urdu language text. A large corpus was collected for performance testing of the presented approach. The results of the experiments show the superiority of the proposed approach over the state-of-the-art representations and other algorithms used for authorship attribution. Manifold contributions of the presented paper are use of improved sqrt-cosine similarity with LDA topics to measure similarity in vectors of text documents for the forensic analysis purpose, construction of a large data set of 6000 documents of articles, and achievement of (92% f1-measure) results on articles without using any labels for authorship attribution task.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords