IEEE Access (Jan 2024)
A Framework for Preparing a Balanced and Comprehensive Phishing Dataset
Abstract
It is not uncommon for people to face phishing attempts on a daily basis, usually via email containing a malicious URL pointing towards a phishing landing page. In recent years, numerous studies have been conducted using machine-learning techniques to detect phishing webpages. These techniques require real-world data from which they extract underlying distinctive patterns that are not easily visible to humans. Capturing and collating such data plays a fundamental role in the overall process. Supervised machine learning algorithms rely on accurate and balanced data for training. Despite the proliferation of research in this field, comparing different studies is a common challenge due to varying data sources, transformations and data cleansing techniques applied when preparing the training dataset. This paper presents a framework for creating a comprehensive and balanced dataset for training machine learning models detecting phishing webpages. The framework covers the process of identifying and gathering the data - phishing and legitimate, data cleansing and highlights important considerations related to the structural composition of the final dataset, like the ratio between phishing and legitimate records or optimal dataset size. Though there is no universal way of preparing a balanced and efficient dataset, the proposed framework provides comprehensive guidelines for constructing one, addressing aspects specific to phishing detection. The practical benefits of applying the framework are accurate, non-skewed, and balanced data, which lead to an accurate model and transparency of data transformation, enabling comparability of the results between different studies.
Keywords