IEEE Access (Jan 2022)

Phishing URL Detection: A Real-Case Scenario Through Login URLs

  • Manuel Sanchez-Paniagua,
  • Eduardo Fidalgo Fernandez,
  • Enrique Alegre,
  • Wesam Al-Nabki,
  • Victor Gonzalez-Castro

DOI
https://doi.org/10.1109/ACCESS.2022.3168681
Journal volume & issue
Vol. 10
pp. 42949 – 42960

Abstract

Read online

Phishing is a social engineering cyberattack where criminals deceive users to obtain their credentials through a login form that submits the data to a malicious server. In this paper, we compare machine learning and deep learning techniques to present a method capable of detecting phishing websites through URL analysis. In most current state-of-the-art solutions dealing with phishing detection, the legitimate class is made up of homepages without including login forms. On the contrary, we use URLs from the login page in both classes because we consider it is much more representative of a real case scenario and we demonstrate that existing techniques obtain a high false-positive rate when tested with URLs from legitimate login pages. Additionally, we use datasets from different years to show how models decrease their accuracy over time by training a base model with old datasets and testing it with recent URLs. Also, we perform a frequency analysis over current phishing domains to identify different techniques carried out by phishers in their campaigns. To prove these statements, we have created a new dataset named Phishing Index Login URL (PILU-90K), which is composed of 60K legitimate URLs, including index and login websites, and 30K phishing URLs. Finally, we present a Logistic Regression model which, combined with Term Frequency - Inverse Document Frequency (TF-IDF) feature extraction, obtains 96.50% accuracy on the introduced login URL dataset.

Keywords