A semantic-based model with a hybrid feature engineering process for accurate spam detection

Chira N. Mohammed; Ayah M. Ahmed

doi:10.1186/s43067-024-00151-3

Journal of Electrical Systems and Information Technology (Jul 2024)

A semantic-based model with a hybrid feature engineering process for accurate spam detection

Chira N. Mohammed,
Ayah M. Ahmed

Affiliations

Chira N. Mohammed: Department of Computer Science, University of Zakho
Ayah M. Ahmed: Department of Computer Science, University of Zakho

DOI: https://doi.org/10.1186/s43067-024-00151-3
Journal volume & issue: Vol. 11, no. 1
pp. 1 – 16

Abstract

Read online

Abstract Detecting spam emails is essential to maintaining the security and integrity of email communication. Existing research has made significant progress in developing effective spam detection models, but challenges remain in improving classification performance and adaptability to evolving spamming techniques. In this study, we propose a novel spam detection model with a comprehensive feature engineering approach that combines term frequency-inverse document frequency (TF-IDF) vectorizer and word embedding features to optimize the feature space. Our contribution lies in integrating semantic-based word embeddings, leveraging pre-existing knowledge to capture the semantic meaning of words and enhance the representation of email texts. To identify the most suitable word embedding technique for our model, we evaluated GloVe, Word2Vec, and FastText. GloVe was selected for its better performance, which is the result of its pre-training on a large and diverse text corpus. Furthermore, the model was evaluated without word embeddings, which did not exhibit the same effectiveness level as our word embedding-based model. Additionally, we utilized the support vector machine as a classifier and hyperparameter tuning technique to identify our model’s most effective parameter values. The proposed model was tested on two datasets. The experimental results showed that our model outperformed the other models discussed in the literature, achieving an accuracy of 99.5% on the SpamAssassin dataset, and 99.28% on the Enron-Spam dataset.

Published in Journal of Electrical Systems and Information Technology

ISSN: 2314-7172 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering; Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: https://jesit.springeropen.com/

About the journal

Abstract

Keywords