A Phishing-Attack-Detection Model Using Natural Language Processing and Deep Learning

Eduardo Benavides-Astudillo; Walter Fuertes; Sandra Sanchez-Gordon; Daniel Nuñez-Agurto; Germán Rodríguez-Galán

doi:10.3390/app13095275

Applied Sciences (Apr 2023)

A Phishing-Attack-Detection Model Using Natural Language Processing and Deep Learning

Eduardo Benavides-Astudillo,
Walter Fuertes,
Sandra Sanchez-Gordon,
Daniel Nuñez-Agurto,
Germán Rodríguez-Galán

Affiliations

Eduardo Benavides-Astudillo: Department of Informatics and Computer Science, Escuela Politécnica Nacional, Quito 170525, Ecuador
Walter Fuertes: Department of Computer Sciences, Universidad de las Fuerzas Armadas ESPE, Sangolquí 171103, Ecuador
Sandra Sanchez-Gordon: Department of Informatics and Computer Science, Escuela Politécnica Nacional, Quito 170525, Ecuador
Daniel Nuñez-Agurto: Department of Computer Sciences, Universidad de las Fuerzas Armadas ESPE, Sangolquí 171103, Ecuador
Germán Rodríguez-Galán: Department of Computer Sciences, Universidad de las Fuerzas Armadas ESPE, Sangolquí 171103, Ecuador

DOI: https://doi.org/10.3390/app13095275
Journal volume & issue: Vol. 13, no. 9
p. 5275

Abstract

Read online

Phishing is a type of cyber-attack that aims to deceive users, usually using fraudulent web pages that appear legitimate. Currently, one of the most-common ways to detect these phishing pages according to their content is by entering words non-sequentially into Deep Learning (DL) algorithms, i.e., regardless of the order in which they have entered the algorithms. However, this approach causes the intrinsic richness of the relationship between words to be lost. In the field of cyber-security, the innovation of this study is to propose a model that detects phishing attacks based on the text of suspicious web pages and not on URL addresses, using Natural Language Processing (NLP) and DL algorithms. We used the Keras Embedding Layer with Global Vectors for Word Representation (GloVe) to exploit the web page content’s semantic and syntactic features. We first performed an analysis using NLP and Word Embedding, and then, these data were introduced into a DL algorithm. In addition, to assess which DL algorithm works best, we evaluated four alternative algorithms: Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent Unit (GRU), and Bidirectional GRU (BiGRU). As a result, it can be concluded that the proposed model is promising because the mean accuracy achieved by each of the four DL algorithms was at least 96.7%, while the best performer was BiGRU with 97.39%.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords