An Effective and Scalable Framework for Authorship Attribution Query Processing

Raheem Sarwar; Chenyun Yu; Ninad Tungare; Kanatip Chitavisutthivong; Sukrit Sriratanawilai; Yaohai Xu; Dickson Chow; Thanawin Rakthanmanon; Sarana Nutanong

doi:10.1109/ACCESS.2018.2869198

IEEE Access (Jan 2018)

An Effective and Scalable Framework for Authorship Attribution Query Processing

Raheem Sarwar,
Chenyun Yu,
Ninad Tungare,
Kanatip Chitavisutthivong,
Sukrit Sriratanawilai,
Yaohai Xu,
Dickson Chow,
Thanawin Rakthanmanon,
Sarana Nutanong

Affiliations

Raheem Sarwar: ORCiD; School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology, Rayong, Thailand
Chenyun Yu: Department of Computer Science, National University of Singapore, Singapore
Ninad Tungare: Department of Computer Science, City University of Hong Kong, Hong Kong
Kanatip Chitavisutthivong: School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology, Rayong, Thailand
Sukrit Sriratanawilai: Department of Computer Engineering, Kasetsart University, Bangkok, Thailand
Yaohai Xu: Department of Computer Science, City University of Hong Kong, Hong Kong
Dickson Chow: Department of Computer Science, City University of Hong Kong, Hong Kong
Thanawin Rakthanmanon: School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology, Rayong, Thailand
Sarana Nutanong: School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology, Rayong, Thailand

DOI: https://doi.org/10.1109/ACCESS.2018.2869198
Journal volume & issue: Vol. 6
pp. 50030 – 50048

Abstract

Read online

Authorship attribution aims at identifying the original author of an anonymous text from a given set of candidate authors and has a wide range of applications. The main challenge in authorship attribution problem is that the real-world applications tend to have hundreds of authors, while each author may have a small number of text samples, e.g., 5–10 texts/author. As a result, building a predictive model that can accurately identify the author of an anonymous text is a challenging task. In fact, existing authorship attribution solutions based on long text focus on application scenarios, where the number of candidate authors is limited to 50. These solutions generally report a significant performance reduction as the number of authors increases. To overcome this challenge, we propose a novel data representation model that captures stylistic variations within each document, which transforms the problem of authorship attribution into a similarity search problem. Based on this data representation model, we also propose a similarity query processing technique that can effectively handle outliers. We assess the accuracy of our proposed method against the state-of-the-art authorship attribution methods using real-world data sets extracted from Project Gutenberg. Our data set contains 3000 novels from 500 authors. Experimental results from this paper show that our method significantly outperforms all competitors. Specifically, as for the closed-set and open-set authorship attribution problems, our method have achieved higher than 95% accuracy.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords