A Framework for Preparing a Balanced and Comprehensive Phishing Dataset

Ivan Skula; Michal Kvet

doi:10.1109/access.2024.3387437

IEEE Access (Jan 2024)

A Framework for Preparing a Balanced and Comprehensive Phishing Dataset

Ivan Skula,
Michal Kvet

Affiliations

Ivan Skula: ORCiD; Department of Informatics, University of Žilina, Žilina, Slovakia
Michal Kvet: ORCiD; Department of Informatics, University of Žilina, Žilina, Slovakia

DOI: https://doi.org/10.1109/access.2024.3387437
Journal volume & issue: Vol. 12
pp. 53610 – 53622

Abstract

Read online

It is not uncommon for people to face phishing attempts on a daily basis, usually via email containing a malicious URL pointing towards a phishing landing page. In recent years, numerous studies have been conducted using machine-learning techniques to detect phishing webpages. These techniques require real-world data from which they extract underlying distinctive patterns that are not easily visible to humans. Capturing and collating such data plays a fundamental role in the overall process. Supervised machine learning algorithms rely on accurate and balanced data for training. Despite the proliferation of research in this field, comparing different studies is a common challenge due to varying data sources, transformations and data cleansing techniques applied when preparing the training dataset. This paper presents a framework for creating a comprehensive and balanced dataset for training machine learning models detecting phishing webpages. The framework covers the process of identifying and gathering the data - phishing and legitimate, data cleansing and highlights important considerations related to the structural composition of the final dataset, like the ratio between phishing and legitimate records or optimal dataset size. Though there is no universal way of preparing a balanced and efficient dataset, the proposed framework provides comprehensive guidelines for constructing one, addressing aspects specific to phishing detection. The practical benefits of applying the framework are accurate, non-skewed, and balanced data, which lead to an accurate model and transparency of data transformation, enabling comparability of the results between different studies.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords