Phishing Website Detection Based on Deep Convolutional Neural Network and Random Forest Ensemble Learning

Rundong Yang; Kangfeng Zheng; Bin Wu; Chunhua Wu; Xiujuan Wang

doi:10.3390/s21248281

Sensors (Dec 2021)

Phishing Website Detection Based on Deep Convolutional Neural Network and Random Forest Ensemble Learning

Rundong Yang,
Kangfeng Zheng,
Bin Wu,
Chunhua Wu,
Xiujuan Wang

Affiliations

Rundong Yang: School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China
Kangfeng Zheng: School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China
Bin Wu: School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China
Chunhua Wu: School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China
Xiujuan Wang: School of Computer Science, Beijing University of Technology, Beijing 100124, China

DOI: https://doi.org/10.3390/s21248281
Journal volume & issue: Vol. 21, no. 24
p. 8281

Abstract

Read online

Phishing has become one of the biggest and most effective cyber threats, causing hundreds of millions of dollars in losses and millions of data breaches every year. Currently, anti-phishing techniques require experts to extract phishing sites features and use third-party services to detect phishing sites. These techniques have some limitations, one of which is that extracting phishing features requires expertise and is time-consuming. Second, the use of third-party services delays the detection of phishing sites. Hence, this paper proposes an integrated phishing website detection method based on convolutional neural networks (CNN) and random forest (RF). The method can predict the legitimacy of URLs without accessing the web content or using third-party services. The proposed technique uses character embedding techniques to convert URLs into fixed-size matrices, extract features at different levels using CNN models, classify multi-level features using multiple RF classifiers, and, finally, output prediction results using a winner-take-all approach. On our dataset, a 99.35% accuracy rate was achieved using the proposed model. An accuracy rate of 99.26% was achieved on the benchmark data, much higher than that of the existing extreme model.

Published in Sensors

ISSN: 1424-8220 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology
Website: http://www.mdpi.com/journal/sensors

About the journal

Abstract

Keywords