IEEE Access (Jan 2020)

The Answer is in the Text: Multi-Stage Methods for Phishing Detection Based on Feature Engineering

  • Eder Souza Gualberto,
  • Rafael Timoteo De Sousa,
  • Thiago Pereira De Brito Vieira,
  • Joao Paulo Carvalho Lustosa Da Costa,
  • Claudio Gottschalg Duque

DOI
https://doi.org/10.1109/ACCESS.2020.3043396
Journal volume & issue
Vol. 8
pp. 223529 – 223547

Abstract

Read online

A phishing attack is a threat based on fraudulent communication, usually by e-mail, where the cybercriminals, impersonating a trusted person or organization, try to lure and coax a target. Phishing detection approaches that obtain highly representational features from the text of these e-mails are a suitable strategy to counter these threats since these features can be used to train machine learning algorithms, thus generating models able to classify mail samples as phishing or legitimate messages. This paper proposes a multi-stage approach to detect phishing e-mail attacks using natural language processing and machine learning. The proposed multi-stage approach consists of feature engineering within natural language processing, lemmatization, feature selection, feature extraction, improved learning techniques for resampling and cross-validation, and the configuration of hyperparameters. We present two methods of the proposed approach, the first one exploiting the Chi-Square statistics and the Mutual Information to improve the dimensionality reduction, while the second method associates Principal Component Analysis (PCA) and Latent Semantic Analysis (LSA). Both methods handle the problems of the “curse of dimensionality”, the sparsity, and the amount of information that must be obtained from the context in the Vector Space Model (VSM) representation. These methods yield reduced feature sets that, combined with the XGBoost and Random Forest machine learning algorithms, lead to an F1-measure of 100% success rate, for validation tests with the SpamAssassin Public Corpus and the Nazario Phishing Corpus datasets. Even considering just the text in e-mail bodies, the proposed multi-stage phishing detection approach outperforms state-of-the-art schemes for an accredited data set, requiring a much smaller number of features and presenting lower computational cost.

Keywords