Statistical Twitter Spam Detection Demystified: Performance, Stability and Scalability

Guanjun Lin; Nan Sun; Surya Nepal; Jun Zhang; Yang Xiang; Houcine Hassan

doi:10.1109/ACCESS.2017.2710540

IEEE Access (Jan 2017)

Statistical Twitter Spam Detection Demystified: Performance, Stability and Scalability

Guanjun Lin,
Nan Sun,
Surya Nepal,
Jun Zhang,
Yang Xiang,
Houcine Hassan

Affiliations

Guanjun Lin: ORCiD; School of Information Technology, Deakin University, Geelong, VIC, Australia
Nan Sun: School of Information Technology, Deakin University, Geelong, VIC, Australia
Surya Nepal: Data61, CSIRO, Melbourne, VIC, Australia
Jun Zhang: School of Information Technology, Deakin University, Geelong, VIC, Australia
Yang Xiang: School of Information Technology, Deakin University, Melbourne, VIC, Australia
Houcine Hassan: Department of Computer Engineering, Universitat Politècnica de València, Valencia, Spain

DOI: https://doi.org/10.1109/ACCESS.2017.2710540
Journal volume & issue: Vol. 5
pp. 11142 – 11154

Abstract

Read online

With the trend that the Internet is becoming more accessible and our devices being more mobile, people are spending an increasing amount of time on social networks. However, due to the popularity of online social networks, cyber criminals are spamming on these platforms for potential victims. The spams lure users to external phishing sites or malware downloads, which has become a huge issue for online safety and undermined user experience. Nevertheless, the current solutions fail to detect Twitter spams precisely and effectively. In this paper, we compared the performance of a wide range of mainstream machine learning algorithms, aiming to identify the ones offering satisfactory detection performance and stability based on a large amount of ground truth data. With the goal of achieving real-time Twitter spam detection capability, we further evaluated the algorithms in terms of the scalability. The performance study evaluates the detection accuracy, the true/false positive rate and the F-measure; the stability examines how stable the algorithms perform using randomly selected training samples of different sizes. The scalability aims to better understand the impact of the parallel computing environment on the reduction of the training/testing time of machine learning algorithms.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords