IEEE Access (Jan 2024)
Advanced Analysis of Learning-Based Spam Email Filtering Methods Based on Feature Distribution Differences of Dataset
Abstract
Spam emails, which are unsolicited bulk emails, pose a significant threat in digital communication security. To counter spam emails, learning-based spam email filtering methods have been extensively studied. However, as spam patterns evolve, these methods face challenges in maintaining the accuracy of models trained on outdated patterns. To demonstrate these limitations empirically and gain insight into the classification patterns of spam email filtering models, we propose an advanced analysis method to analyze the performance degradation of spam email filtering models. The proposed analysis method involves text preprocessing, embedding model training, spam email filtering model training, evaluation, and analysis of the classification patterns of the learning-based spam email filtering models. From the experimental results under various datasets and spam email filtering models, we show that the accuracy of spam email filtering models significantly decreases when the feature distribution of the test dataset is different from the training dataset. We also provides valuable insights for improving the model architecture, dataset structure, and training strategies by analysis of various factors such as confusion matrix, performance metrics, mean sequence length, out-of-vocabulary (OOV) rate, and top-20 tokens.
Keywords