Phishing URL Detection: A Real-Case Scenario Through Login URLs

Manuel Sanchez-Paniagua; Eduardo Fidalgo Fernandez; Enrique Alegre; Wesam Al-Nabki; Victor Gonzalez-Castro

doi:10.1109/access.2022.3168681

IEEE Access (Jan 2022)

Phishing URL Detection: A Real-Case Scenario Through Login URLs

Manuel Sanchez-Paniagua,
Eduardo Fidalgo Fernandez,
Enrique Alegre,
Wesam Al-Nabki,
Victor Gonzalez-Castro

Affiliations

Manuel Sanchez-Paniagua: ORCiD; Department of Electrical, Systems and Automation Engineering, Universidad de León, León, Spain
Eduardo Fidalgo Fernandez: ORCiD; Department of Electrical, Systems and Automation Engineering, Universidad de León, León, Spain
Enrique Alegre: ORCiD; Department of Electrical, Systems and Automation Engineering, Universidad de León, León, Spain
Wesam Al-Nabki: ORCiD; Department of Electrical, Systems and Automation Engineering, Universidad de León, León, Spain
Victor Gonzalez-Castro: ORCiD; Department of Electrical, Systems and Automation Engineering, Universidad de León, León, Spain

DOI: https://doi.org/10.1109/access.2022.3168681
Journal volume & issue: Vol. 10
pp. 42949 – 42960

Abstract

Read online

Phishing is a social engineering cyberattack where criminals deceive users to obtain their credentials through a login form that submits the data to a malicious server. In this paper, we compare machine learning and deep learning techniques to present a method capable of detecting phishing websites through URL analysis. In most current state-of-the-art solutions dealing with phishing detection, the legitimate class is made up of homepages without including login forms. On the contrary, we use URLs from the login page in both classes because we consider it is much more representative of a real case scenario and we demonstrate that existing techniques obtain a high false-positive rate when tested with URLs from legitimate login pages. Additionally, we use datasets from different years to show how models decrease their accuracy over time by training a base model with old datasets and testing it with recent URLs. Also, we perform a frequency analysis over current phishing domains to identify different techniques carried out by phishers in their campaigns. To prove these statements, we have created a new dataset named Phishing Index Login URL (PILU-90K), which is composed of 60K legitimate URLs, including index and login websites, and 30K phishing URLs. Finally, we present a Logistic Regression model which, combined with Term Frequency - Inverse Document Frequency (TF-IDF) feature extraction, obtains 96.50% accuracy on the introduced login URL dataset.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords