IEEE Access (Jan 2020)
Efficient Clustering of Emails Into Spam and Ham: The Foundational Study of a Comprehensive Unsupervised Framework
Abstract
The spread and adoption of spam emails in malicious activities like information and identity theft, malware propagation, monetary and reputational damage etc. are on the rise with increased effectiveness and diversification. Without doubt these criminal acts endanger the privacy of many users and businesses'. Several research initiatives have taken place to address the issue with no complete solution until now; and we believe an intelligent and automated methodology should be the way forward to tackle the challenges. However, till date limited studies have been conducted on the applications of purely unsupervised frameworks and algorithms in tackling the problem. To explore and investigate the possibilities, we intend to propose an anti-spam framework that fully relies on unsupervised methodologies through a multi-algorithm clustering approach. This article presents an in-depth analysis on the methodologies of the first component of the framework, examining only the domain and header related information found in email headers. A novel method of feature reduction using an ensemble of `unsupervised' feature selection algorithms has also been investigated in this study. In addition, a comprehensive novel dataset of 100,000 records of ham and spam emails has been developed and used as the data source. Key findings are summarized as follows: I) out of six different clustering algorithms used - Spectral and K-means demonstrated acceptable performance while OPTICS projected the optimum clustering with an average of 3.5% better efficiency than Spectral and K-means, validated through a range of validations processes II) The other three algorithms- BIRCH, HDBSCAN and K-modes, did not fare well enough. III) The average balanced accuracy for the optimum three algorithms has been found to be ≈94.91%, and IV) The proposed feature reduction framework achieved its goal with high confidence.
Keywords